This is a draft document that was built and uploaded automatically. It may document beta software and be incomplete or even incorrect. Use this document at your own risk.

Jump to content
Operations Guide CLM
SUSE OpenStack Cloud 9

Operations Guide CLM

At the time of the SUSE OpenStack Cloud 9 release, this guide contains information pertaining to the operation, administration, and user functions of SUSE OpenStack Cloud. The audience is the admin-level operator of the cloud.

Publication Date: 02 Nov 2022
List of Figures
List of Examples

Copyright © 2006– 2022 SUSE LLC and contributors. All rights reserved.

Except where otherwise noted, this document is licensed under Creative Commons Attribution 3.0 License : https://creativecommons.org/licenses/by/3.0/legalcode.

For SUSE trademarks, see https://www.suse.com/company/legal/. All other third-party trademarks are the property of their respective owners. Trademark symbols (®, ™ etc.) denote trademarks of SUSE and its affiliates. Asterisks (*) denote third-party trademarks.

All information found in this book has been compiled with utmost attention to detail. However, this does not guarantee complete accuracy. Neither SUSE LLC, its affiliates, the authors nor the translators shall be held liable for possible errors or the consequences thereof.

1 Operations Overview

A high-level overview of the processes related to operating a SUSE OpenStack Cloud 9 cloud.

1.1 What is a cloud operator?

When we talk about a cloud operator it is important to understand the scope of the tasks and responsibilities we are referring to. SUSE OpenStack Cloud defines a cloud operator as the person or group of people who will be administering the cloud infrastructure, which includes:

  • Monitoring the cloud infrastructure, resolving issues as they arise.

  • Managing hardware resources, adding/removing hardware due to capacity needs.

  • Repairing, and recovering if needed, any hardware issues.

  • Performing domain administration tasks, which involves creating and managing projects, users, and groups as well as setting and managing resource quotas.

1.2 Tools provided to operate your cloud

SUSE OpenStack Cloud provides the following tools which are available to operate your cloud:

Operations Console

Often referred to as the Ops Console, you can use this console to view data about your cloud infrastructure in a web-based graphical user interface (GUI) to make sure your cloud is operating correctly. By logging on to the console, SUSE OpenStack Cloud administrators can manage data in the following ways:

  • Triage alarm notifications in the central dashboard

  • Monitor the environment by giving priority to alarms that take precedence

  • Manage compute nodes and easily use a form to create a new host

  • Refine the monitoring environment by creating new alarms to specify a combination of metrics, services, and hosts that match the triggers unique to an environment

  • Plan for future storage by tracking capacity over time to predict with some degree of reliability the amount of additional storage needed

Dashboard

Often referred to as horizon or the horizon dashboard, you can use this console to manage resources on a domain and project level in a web-based graphical user interface (GUI). The following are some of the typical operational tasks that you may perform using the dashboard:

  • Creating and managing projects, users, and groups within your domain.

  • Assigning roles to users and groups to manage access to resources.

  • Setting and updating resource quotas for the projects.

For more details, see the following page: Section 5.3, “Understanding Domains, Projects, Users, Groups, and Roles”

Command-line interface (CLI)

The OpenStack community has created a unified client, called the openstackclient (OSC), which combines the available commands in the various service-specific clients into one tool. Some service-specific commands do not have OSC equivalents.

You will find processes defined in our documentation that use these command-line tools. There is also a list of common cloud administration tasks which we have outlined which you can use the command-line tools to do.

There are references throughout the SUSE OpenStack Cloud documentation to the HPE Smart Storage Administrator (HPE SSA) CLI. HPE-specific binaries that are not based on open source are distributed directly from and supported by HPE. To download and install the SSACLI utility, please refer to: https://support.hpe.com/hpsc/swd/public/detail?swItemId=MTX_3d16386b418a443388c18da82f

1.3 Daily tasks

  • Ensure your cloud is running correctly: SUSE OpenStack Cloud is deployed as a set of highly available services to minimize the impact of failures. That said, hardware and software systems can fail. Detection of failures early in the process will enable you to address issues before they affect the broader system. SUSE OpenStack Cloud provides a monitoring solution, based on OpenStack’s monasca, which provides monitoring and metrics for all OpenStack components and much of the underlying system, including service status, performance metrics, compute node, and virtual machine status. Failures are exposed via the Operations Console and/or alarm notifications. In the case where more detailed diagnostics are required, you can use a centralized logging system based on the Elasticsearch, Logstash, and Kibana (ELK) stack. This provides the ability to search service logs to get detailed information on behavior and errors.

  • Perform critical maintenance: To ensure your OpenStack installation is running correctly, provides the right access and functionality, and is secure, you should make ongoing adjustments to the environment. Examples of daily maintenance tasks include:

    • Add/remove projects and users. The frequency of this task depends on your policy.

    • Apply security patches (if released).

    • Run daily backups.

1.4 Weekly or monthly tasks

  • Do regular capacity planning: Your initial deployment will likely reflect the known near to mid-term scale requirements, but at some point your needs will outgrow your initial deployment’s capacity. You can expand SUSE OpenStack Cloud in a variety of ways, such as by adding compute and storage capacity.

To manage your cloud’s capacity, begin by determining the load on the existing system. OpenStack is a set of relatively independent components and services, so there are multiple subsystems that can affect capacity. These include control plane nodes, compute nodes, object storage nodes, block storage nodes, and an image management system. At the most basic level, you should look at the CPU used, RAM used, I/O load, and the disk space used relative to the amounts available. For compute nodes, you can also evaluate the allocation of resource to hosted virtual machines. This information can be viewed in the Operations Console. You can pull historical information from the monitoring service (OpenStack’s monasca) by using its client or API. Also, OpenStack provides you some ability to manage the hosted resource utilization by using quotas for projects. You can track this usage over time to get your growth trend so that you can project when you will need to add capacity.

1.5 Semi-annual tasks

  • Perform upgrades: OpenStack releases new versions on a six-month cycle. In general, SUSE OpenStack Cloud will release new major versions annually with minor versions and maintenance updates more often. Each new release consists of both new functionality and services, as well as bug fixes for existing functionality.

Note
Note

If you are planning to upgrade, this is also an excellent time to evaluate your existing capabilities, especially in terms of capacity (see Capacity Planning above).

1.6 Troubleshooting

As part of managing your cloud, you should be ready to troubleshoot issues, as needed. The following are some common troubleshooting scenarios and solutions:

How do I determine if my cloud is operating correctly now?: SUSE OpenStack Cloud provides a monitoring solution based on OpenStack’s monasca service. This service provides monitoring and metrics for all OpenStack components, as well as much of the underlying system. By default, SUSE OpenStack Cloud comes with a set of alarms that provide coverage of the primary systems. In addition, you can define alarms based on threshold values for any metrics defined in the system. You can view alarm information in the Operations Console. You can also receive or deliver this information to others by configuring email or other mechanisms. Alarms provide information about whether a component failed and is affecting the system, and also what condition triggered the alarm.

How do I troubleshoot and resolve performance issues for my cloud?: There are a variety of factors that can affect the performance of a cloud system, such as the following:

  • Health of the control plane

  • Health of the hosting compute node and virtualization layer

  • Resource allocation on the compute node

If your cloud users are experiencing performance issues on your cloud, use the following approach:

  1. View the compute summary page on the Operations Console to determine if any alarms have been triggered.

  2. Determine the hosting node of the virtual machine that is having issues.

  3. On the compute hosts page, view the status and resource utilization of the compute node to determine if it has errors or is over-allocated.

  4. On the compute instances page you can view the status of the VM along with its metrics.

How do I troubleshoot and resolve availability issues for my cloud?: If your cloud users are experiencing availability issues, determine what your users are experiencing that indicates to them the cloud is down. For example, can they not access the Dashboard service (horizon) console or APIs, indicating a problem with the control plane? Or are they having trouble accessing resources? Console/API issues would indicate a problem with the control planes. Use the Operations Console to view the status of services to see if there is an issue. However, if it is an issue of accessing a virtual machine, then also search the consolidated logs that are available in the ELK stack or errors related to the virtual machine and supporting networking.

1.7 Common Questions

What skills do my cloud administrators need?

Your administrators should be experienced Linux admins. They should have experience in application management, as well as experience with Ansible. It is a plus if they have experience with Bash shell scripting and Python programming skills.

In addition, you will need skilled networking engineering staff to administer the cloud network environment.

2 Tutorials

This section contains tutorials for common tasks for your SUSE OpenStack Cloud 9 cloud.

2.1 SUSE OpenStack Cloud Quickstart Guide

2.1.1 Introduction

This document provides simplified instructions for installing and setting up a SUSE OpenStack Cloud. Use this quickstart guide to build testing, demonstration, and lab-type environments., rather than production installations. When you complete this quickstart process, you will have a fully functioning SUSE OpenStack Cloud demo environment.

Note
Note

These simplified instructions are intended for testing or demonstration. Instructions for production installations are in Book “Deployment Guide using Cloud Lifecycle Manager.

2.1.2 Overview of components

The following are short descriptions of the components that SUSE OpenStack Cloud employs when installing and deploying your cloud.

Ansible.  Ansible is a powerful configuration management tool used by SUSE OpenStack Cloud to manage nearly all aspects of your cloud infrastructure. Most commands in this quickstart guide execute Ansible scripts, known as playbooks. You will run playbooks that install packages, edit configuration files, manage network settings, and take care of the general administration tasks required to get your cloud up and running.

Get more information on Ansible at https://www.ansible.com/.

Cobbler.  Cobbler is another third-party tool used by SUSE OpenStack Cloud to deploy operating systems across the physical servers that make up your cloud. Find more info at http://cobbler.github.io/.

Git.  Git is the version control system used to manage the configuration files that define your cloud. Any changes made to your cloud configuration files must be committed to the locally hosted git repository to take effect. Read more information on Git at https://git-scm.com/.

2.1.3 Preparation

Successfully deploying a SUSE OpenStack Cloud environment is a large endeavor, but it is not complicated. For a successful deployment, you must put a number of components in place before rolling out your cloud. Most importantly, a basic SUSE OpenStack Cloud requires the proper network infrastrucure. Because SUSE OpenStack Cloud segregates the network traffic of many of its elements, if the necessary networks, routes, and firewall access rules are not in place, communication required for a successful deployment will not occur.

2.1.4 Getting Started

When your network infrastructure is in place, go ahead and set up the Cloud Lifecycle Manager. This is the server that will orchestrate the deployment of the rest of your cloud. It is also the server you will run most of your deployment and management commands on.

Set up the Cloud Lifecycle Manager

  1. Download the installation media

    Obtain a copy of the SUSE OpenStack Cloud installation media, and make sure that it is accessible by the server that you are installing it on. Your method of doing this may vary. For instance, some may choose to load the installation ISO on a USB drive and physically attach it to the server, while others may run the IPMI Remote Console and attach the ISO to a virtual disc drive.

  2. Install the operating system

    1. Boot your server, using the installation media as the boot source.

    2. Choose "install" from the list of options and choose your preferred keyboard layout, location, language, and other settings.

    3. Set the address, netmask, and gateway for the primary network interface.

    4. Create a root user account.

    Proceed with the OS installation. After the installation is complete and the server has rebooted into the new OS, log in with the user account you created.

  3. Configure the new server

    1. SSH to your new server, and set a valid DNS nameserver in the /etc/resolv.conf file.

    2. Set the environment variable LC_ALL:

      export LC_ALL=C

    You now have a server running SUSE Linux Enterprise Server (SLES). The next step is to configure this machine as a Cloud Lifecycle Manager.

  4. Configure the Cloud Lifecycle Manager

    The installation media you used to install the OS on the server also has the files that will configure your cloud. You need to mount this installation media on your new server in order to use these files.

    1. Using the URL that you obtained the SUSE OpenStack Cloud installation media from, run wget to download the ISO file to your server:

      wget INSTALLATION_ISO_URL
    2. Now mount the ISO in the /media/cdrom/ directory

      sudo mount INSTALLATION_ISO /media/cdrom/
    3. Unpack the tar file found in the /media/cdrom/ardana/ directory where you just mounted the ISO:

      tar xvf /media/cdrom/ardana/ardana-x.x.x-x.tar
    4. Now you will install and configure all the components needed to turn this server into a Cloud Lifecycle Manager. Run the ardana-init.bash script from the uncompressed tar file:

      ~/ardana-x.x.x/ardana-init.bash

      The ardana-init.bash script prompts you to enter an optional SSH passphrase. This passphrase protects the RSA key used to SSH to the other cloud nodes. This is an optional passphrase, and you can skip it by pressing Enter at the prompt.

      The ardana-init.bash script automatically installs and configures everything needed to set up this server as the lifecycle manager for your cloud.

      When the script has finished running, you can proceed to the next step, editing your input files.

  5. Edit your input files

    Your SUSE OpenStack Cloud input files are where you define your cloud infrastructure and how it runs. The input files define options such as which servers are included in your cloud, the type of disks the servers use, and their network configuration. The input files also define which services your cloud will provide and use, the network architecture, and the storage backends for your cloud.

    There are several example configurations, which you can find on your Cloud Lifecycle Manager in the ~/openstack/examples/ directory.

    1. The simplest way to set up your cloud is to copy the contents of one of these example configurations to your ~/openstack/mycloud/definition/ directory. You can then edit the copied files and define your cloud.

      cp -r ~/openstack/examples/CHOSEN_EXAMPLE/* ~/openstack/my_cloud/definition/
    2. Edit the files in your ~/openstack/my_cloud/definition/ directory to define your cloud.

  6. Commit your changes

    When you finish editing the necessary input files, stage them, and then commit the changes to the local Git repository:

    cd ~/openstack/ardana/ansible
    git add -A
    git commit -m "My commit message"
  7. Image your servers

    Now that you have finished editing your input files, you can deploy the configuration to the servers that will comprise your cloud.

    1. Image the servers. You will install the SLES operating system across all the servers in your cloud, using Ansible playbooks to trigger the process.

    2. The following playbook confirms that your servers are accessible over their IPMI ports, which is a prerequisite for the imaging process:

      ansible-playbook -i hosts/localhost bm-power-status.yml
    3. Now validate that your cloud configuration files have proper YAML syntax by running the config-processor-run.yml playbook:

      ansible-playbook -i hosts/localhost config-processor-run.yml

      If you receive an error when running the preceeding playbook, one or more of your configuration files has an issue. Refer to the output of the Ansible playbook, and look for clues in the Ansible log file, found at ~/.ansible/ansible.log.

    4. The next step is to prepare your imaging system, Cobbler, to deploy operating systems to all your cloud nodes:

      ansible-playbook -i hosts/localhost cobbler-deploy.yml
    5. Now you can image your cloud nodes. You will use an Ansible playbook to trigger Cobbler to deploy operating systems to all the nodes you specified in your input files:

      ansible-playbook -i hosts/localhost bm-reimage.yml

      The bm-reimage.yml playbook performs the following operations:

      1. Powers down the servers.

      2. Sets the servers to boot from a network interface.

      3. Powers on the servers and performs a PXE OS installation.

      4. Waits for the servers to power themselves down as part of a successful OS installation. This can take some time.

      5. Sets the servers to boot from their local hard disks and powers on the servers.

      6. Waits for the SSH service to start on the servers and verifies that they have the expected host-key signature.

  8. Deploy your cloud

    Now that your servers are running the SLES operating system, it is time to configure them for the roles they will play in your new cloud.

    1. Prepare the Cloud Lifecycle Manager to deploy your cloud configuration to all the nodes:

      ansible-playbook -i hosts/localhost ready-deployment.yml

      NOTE: The preceding playbook creates a new directory, ~/scratch/ansible/next/ardana/ansible/, from which you will run many of the following commands.

    2. (Optional) If you are reusing servers or disks to run your cloud, you can wipe the disks of your newly imaged servers by running the wipe_disks.yml playbook:

      cd ~/scratch/ansible/next/ardana/ansible/
      ansible-playbook -i hosts/verb_hosts wipe_disks.yml

      The wipe_disks.yml playbook removes any existing data from the drives on your new servers. This can be helpful if you are reusing servers or disks. This action will not affect the OS partitions on the servers.

      Note
      Note

      The wipe_disks.yml playbook is only meant to be run on systems immediately after running bm-reimage.yml. If used for any other case, it may not wipe all of the expected partitions. For example, if site.yml fails, you cannot start fresh by running wipe_disks.yml. You must bm-reimage the node first and then run wipe_disks.

    3. Now it is time to deploy your cloud. Do this by running the site.yml playbook, which pushes the configuration you defined in the input files out to all the servers that will host your cloud.

      cd ~/scratch/ansible/next/ardana/ansible/
      ansible-playbook -i hosts/verb_hosts site.yml

      The site.yml playbook installs packages, starts services, configures network interface settings, sets iptables firewall rules, and more. Upon successful completion of this playbook, your SUSE OpenStack Cloud will be in place and in a running state. This playbook can take up to six hours to complete.

  9. SSH to your nodes

    Now that you have successfully run site.yml, your cloud will be up and running. You can verify connectivity to your nodes by connecting to each one by using SSH. You can find the IP addresses of your nodes by viewing the /etc/hosts file.

    For security reasons, you can only SSH to your nodes from the Cloud Lifecycle Manager. SSH connections from any machine other than the Cloud Lifecycle Manager will be refused by the nodes.

    From the Cloud Lifecycle Manager, SSH to your nodes:

    ssh <management IP address of node>

    Also note that SSH is limited to your cloud's management network. Each node has an address on the management network, and you can find this address by reading the /etc/hosts or server_info.yml file.

2.2 Installing the Command-Line Clients

During the installation, by default, the suite of OpenStack command-line tools are installed on the Cloud Lifecycle Manager and the control plane in your environment. You can learn more about these in the OpenStack documentation here: OpenStackClient.

If you wish to install the command-line interfaces on other nodes in your environment, there are two methods you can use to do so that we describe below.

2.2.1 Installing the CLI tools using the input model

During the initial install phase of your cloud you can edit your input model to request that the command-line clients be installed on any of the node clusters in your environment. To do so, follow these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Edit your control_plane.yml file. Full path:

    ~/openstack/my_cloud/definition/data/control_plane.yml
  3. In this file you will see a list of service-components to be installed on each of your clusters. These clusters will be divided per role, with your controller node cluster likely coming at the beginning. Here you will see a list of each of the clients that can be installed. These include:

    keystone-client
    glance-client
    cinder-client
    nova-client
    neutron-client
    swift-client
    heat-client
    openstack-client
    monasca-client
    barbican-client
    designate-client
  4. For each client you want to install, specify the name under the service-components section for the cluster you want to install it on.

    So, for example, if you would like to install the nova and neutron clients on your Compute node cluster, you can do so by adding the nova-client and neutron-client services, like this:

          resources:
            - name: compute
              resource-prefix: comp
              server-role: COMPUTE-ROLE
              allocation-policy: any
              min-count: 0
              service-components:
                - ntp-client
                - nova-compute
                - nova-compute-kvm
                - neutron-l3-agent
                - neutron-metadata-agent
                - neutron-openvswitch-agent
                - nova-client
                - neutron-client
    Note
    Note

    This example uses the entry-scale-kvm sample file. Your model may be different so use this as a guide but do not copy and paste the contents of this example into your input model.

  5. Commit your configuration to the local git repo, as follows:

    cd ~/openstack/ardana/ansible
    git add -A
    git commit -m "My config or other commit message"
  6. Continue with the rest of your installation.

2.2.2 Installing the CLI tools using Ansible

At any point after your initial installation you can install the command-line clients on any of the nodes in your environment. To do so, follow these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Obtain the hostname for the nodes you want to install the clients on by looking in your hosts file:

    cat /etc/hosts
  3. Install the clients using this playbook, specifying your hostnames using commas:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts -e "install_package=<client_name>" client-deploy.yml -e "install_hosts=<hostname>"

    So, for example, if you would like to install the novaClient on two of your Compute nodes with hostnames ardana-cp1-comp0001-mgmt and ardana-cp1-comp0002-mgmt you can use this syntax:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts -e "install_package=novaclient" client-deploy.yml -e "install_hosts=ardana-cp1-comp0001-mgmt,ardana-cp1-comp0002-mgmt"
  4. Once the playbook completes successfully, you should be able to SSH to those nodes and, using the proper credentials, authenticate and use the command-line interfaces you have installed.

2.3 Cloud Admin Actions with the Command Line

Cloud admins can use the command line tools to perform domain admin tasks such as user and project administration.

2.3.1 Creating Additional Cloud Admins

You can create additional Cloud Admins to help with the administration of your cloud.

keystone identity service query and administration tasks can be performed using the OpenStack command line utility. The utility is installed by the Cloud Lifecycle Manager onto the Cloud Lifecycle Manager.

Note
Note

keystone administration tasks should be performed by an admin user with a token scoped to the default domain via the keystone v3 identity API. These settings are preconfigured in the file ~/keystone.osrc. By default, keystone.osrc is configured with the admin endpoint of keystone. If the admin endpoint is not accessible from your network, change OS_AUTH_URL to point to the public endpoint.

2.3.2 Command Line Examples

For a full list of OpenStackClient commands, see OpenStackClient Command List.

Sourcing the keystone Administration Credentials

You can set the environment variables needed for identity administration by sourcing the keystone.osrc file created by the lifecycle manager:

source ~/keystone.osrc

List users in the default domain

These users are created by the Cloud Lifecycle Manager in the MySQL back end:

openstack user list

Example output:

$ openstack user list
+----------------------------------+------------------+
| ID                               | Name             |
+----------------------------------+------------------+
| 155b68eda9634725a1d32c5025b91919 | heat             |
| 303375d5e44d48f298685db7e6a4efce | octavia          |
| 40099e245a394e7f8bb2aa91243168ee | logging          |
| 452596adbf4d49a28cb3768d20a56e38 | admin            |
| 76971c3ad2274820ad5347d46d7560ec | designate        |
| 7b2dc0b5bb8e4ffb92fc338f3fa02bf3 | hlm_backup       |
| 86d345c960e34c9189519548fe13a594 | barbican         |
| 8e7027ab438c4920b5853d52f1e08a22 | nova_monasca     |
| 9c57dfff57e2400190ab04955e7d82a0 | barbican_service |
| a3f99bcc71b242a1bf79dbc9024eec77 | nova             |
| aeeb56fc4c4f40e0a6a938761f7b154a | glance-check     |
| af1ef292a8bb46d9a1167db4da48ac65 | cinder           |
| af3000158c6d4d3d9257462c9cc68dda | demo             |
| b41a7d0cb1264d949614dc66f6449870 | swift            |
| b78a2b17336b43368fb15fea5ed089e9 | cinderinternal   |
| bae1718dee2d47e6a75cd6196fb940bd | monasca          |
| d4b9b32f660943668c9f5963f1ff43f9 | ceilometer       |
| d7bef811fb7e4d8282f19fb3ee5089e9 | swift-monitor    |
| e22bbb2be91342fd9afa20baad4cd490 | neutron          |
| ec0ad2418a644e6b995d8af3eb5ff195 | glance           |
| ef16c37ec7a648338eaf53c029d6e904 | swift-dispersion |
| ef1a6daccb6f4694a27a1c41cc5e7a31 | glance-swift     |
| fed3a599b0864f5b80420c9e387b4901 | monasca-agent    |
+----------------------------------+------------------+

List domains created by the installation process:

openstack domain list

Example output:

$ openstack domain list
+----------------------------------+---------+---------+----------------------------------------------------------------------+
| ID                               | Name    | Enabled | Description                                                          |
+----------------------------------+---------+---------+----------------------------------------------------------------------+
| 6740dbf7465a4108a36d6476fc967dbd | heat    | True    | Owns users and projects created by heat                              |
| default                          | Default | True    | Owns users and tenants (i.e. projects) available on Identity API v2. |
+----------------------------------+---------+---------+----------------------------------------------------------------------+

List the roles:

openstack role list

Example output:

$ openstack role list
+----------------------------------+---------------------------+
| ID                               | Name                      |
+----------------------------------+---------------------------+
| 0be3da26cd3f4cd38d490b4f1a8b0c03 | designate_admin           |
| 13ce16e4e714473285824df8188ee7c0 | monasca-agent             |
| 160f25204add485890bc95a6065b9954 | key-manager:service-admin |
| 27755430b38c411c9ef07f1b78b5ebd7 | monitor                   |
| 2b8eb0a261344fbb8b6b3d5934745fe1 | key-manager:observer      |
| 345f1ec5ab3b4206a7bffdeb5318bd32 | admin                     |
| 49ba3b42696841cea5da8398d0a5d68e | nova_admin                |
| 5129400d4f934d4fbfc2c3dd608b41d9 | ResellerAdmin             |
| 60bc2c44f8c7460a9786232a444b56a5 | neutron_admin             |
| 654bf409c3c94aab8f929e9e82048612 | cinder_admin              |
| 854e542baa144240bfc761cdb5fe0c07 | monitoring-delegate       |
| 8946dbdfa3d346b2aa36fa5941b43643 | key-manager:auditor       |
| 901453d9a4934610ad0d56434d9276b4 | key-manager:admin         |
| 9bc90d1121544e60a39adbfe624a46bc | monasca-user              |
| 9fe2a84a3e7443ae868d1009d6ab4521 | service                   |
| 9fe2ff9ee4384b1894a90878d3e92bab | member                    |
| a24d4e0a5de14bffbe166bfd68b36e6a | swiftoperator             |
| ae088fcbf579425580ee4593bfa680e5 | heat_stack_user           |
| bfba56b2562942e5a2e09b7ed939f01b | keystoneAdmin             |
| c05f54cf4bb34c7cb3a4b2b46c2a448b | glance_admin              |
| fe010be5c57240db8f559e0114a380c1 | key-manager:creator       |
+----------------------------------+---------------------------+

List admin user role assignment within default domain:

openstack role assignment list --user admin --domain default

Example output:

# This indicates that the admin user is assigned the admin role within the default domain
ardana >  openstack role assignment list --user admin --domain default
+----------------------------------+----------------------------------+-------+---------+---------+
| Role                             | User                             | Group | Project | Domain  |
+----------------------------------+----------------------------------+-------+---------+---------+
| b398322103504546a070d607d02618ad | fed1c038d9e64392890b6b44c38f5bbb |       |         | default |
+----------------------------------+----------------------------------+-------+---------+---------+

Create a new user in default domain:

openstack user create --domain default --password-prompt --email <email_address> --description <description> --enable <username>

Example output showing the creation of a user named testuser with email address test@example.com and a description of Test User:

ardana >  openstack user create --domain default --password-prompt --email test@example.com --description "Test User" --enable testuser
User Password:
Repeat User Password:
+-------------+----------------------------------+
| Field       | Value                            |
+-------------+----------------------------------+
| description | Test User                        |
| domain_id   | default                          |
| email       | test@example.com                 |
| enabled     | True                             |
| id          | 8aad69acacf0457e9690abf8c557754b |
| name        | testuser                         |
+-------------+----------------------------------+

Assign admin role for testuser within the default domain:

openstack role add admin --user <username> --domain default
openstack role assignment list --user <username> --domain default

Example output:

# Just for demonstration purposes - do not do this in a production environment!
ardana >  openstack role add admin --user testuser --domain default
ardana >  openstack role assignment list --user testuser --domain default
+----------------------------------+----------------------------------+-------+---------+---------+
| Role                             | User                             | Group | Project | Domain  |
+----------------------------------+----------------------------------+-------+---------+---------+
| b398322103504546a070d607d02618ad | 8aad69acacf0457e9690abf8c557754b |       |         | default |
+----------------------------------+----------------------------------+-------+---------+---------+

2.3.3 Assigning the default service admin roles

The following examples illustrate how you can assign each of the new service admin roles to a user.

Assigning the glance_admin role

A user must have the role of admin in order to assign the glance_admin role. To assign the role, you will set the environment variables needed for the identity service administrator.

  1. First, source the identity service credentials:

    source ~/keystone.osrc
  2. You can add the glance_admin role to a user on a project with this command:

    openstack role add --user <username> --project <project_name> glance_admin

    Example, showing a user named testuser being granted the glance_admin role in the test_project project:

    openstack role add --user testuser --project test_project glance_admin
  3. You can confirm the role assignment by listing out the roles:

    openstack role assignment list --user <username>

    Example output:

    ardana >  openstack role assignment list --user testuser
    +----------------------------------+----------------------------------+-------+----------------------------------+--------+-----------+
    | Role                             | User                             | Group | Project                          | Domain | Inherited |
    +----------------------------------+----------------------------------+-------+----------------------------------+--------+-----------+
    | 46ba80078bc64853b051c964db918816 | 8bcfe10101964e0c8ebc4de391f3e345 |       | 0ebbf7640d7948d2a17ac08bbbf0ca5b |        | False     |
    +----------------------------------+----------------------------------+-------+----------------------------------+--------+-----------+
  4. Note that only the role ID is displayed. To get the role name, execute the following:

    openstack role show <role_id>

    Example output:

    ardana >  openstack role show 46ba80078bc64853b051c964db918816
    +-------+----------------------------------+
    | Field | Value                            |
    +-------+----------------------------------+
    | id    | 46ba80078bc64853b051c964db918816 |
    | name  | glance_admin                     |
    +-------+----------------------------------+
  5. To demonstrate that the user has glance admin privileges, authenticate with those user creds and then upload and publish an image. Only a user with an admin role or glance_admin can publish an image.

    1. The easiest way to do this will be to make a copy of the service.osrc file and edit it with your user credentials. You can do that with this command:

      cp ~/service.osrc ~/user.osrc
    2. Using your preferred editor, edit the user.osrc file and replace the values for the following entries to match your user credentials:

      export OS_USERNAME=<username>
      export OS_PASSWORD=<password>
    3. You will also need to edit the following lines for your environment:

      ## Change these values from 'unset' to 'export'
      export OS_PROJECT_NAME=<project_name>
      export OS_PROJECT_DOMAIN_NAME=Default

      Here is an example output:

      unset OS_DOMAIN_NAME
      export OS_IDENTITY_API_VERSION=3
      export OS_AUTH_VERSION=3
      export OS_PROJECT_NAME=test_project
      export OS_PROJECT_DOMAIN_NAME=Default
      export OS_USERNAME=testuser
      export OS_USER_DOMAIN_NAME=Default
      export OS_PASSWORD=testuser
      export OS_AUTH_URL=http://192.168.245.9:35357/v3
      export OS_ENDPOINT_TYPE=internalURL
      # OpenstackClient uses OS_INTERFACE instead of OS_ENDPOINT
      export OS_INTERFACE=internal
      export OS_CACERT=/etc/ssl/certs/ca-certificates.crt
  6. Source the environment variables for your user:

    source ~/user.osrc
  7. Upload an image and publicize it:

    openstack image create --name "upload me" --visibility public --container-format bare --disk-format qcow2 --file uploadme.txt

    Example output:

    +------------------+--------------------------------------+
    | Property         | Value                                |
    +------------------+--------------------------------------+
    | checksum         | dd75c3b840a16570088ef12f6415dd15     |
    | container_format | bare                                 |
    | created_at       | 2016-01-06T23:31:27Z                 |
    | disk_format      | qcow2                                |
    | id               | cf1490f4-1eb1-477c-92e8-15ebbe91da03 |
    | min_disk         | 0                                    |
    | min_ram          | 0                                    |
    | name             | upload me                            |
    | owner            | bd24897932074780a20b780c4dde34c7     |
    | protected        | False                                |
    | size             | 10                                   |
    | status           | active                               |
    | tags             | []                                   |
    | updated_at       | 2016-01-06T23:31:31Z                 |
    | virtual_size     | None                                 |
    | visibility       | public                               |
    +------------------+--------------------------------------+
    Note
    Note

    You can use the command openstack help image create to get the full syntax for this command.

Assigning the nova_admin role

A user must have the role of admin in order to assign the nova_admin role. To assign the role, you will set the environment variables needed for the identity service administrator.

  1. First, source the identity service credentials:

    source ~/keystone.osrc
  2. You can add the glance_admin role to a user on a project with this command:

    openstack role add --user <username> --project <project_name> nova_admin

    Example, showing a user named testuser being granted the glance_admin role in the test_project project:

    openstack role add --user testuser --project test_project nova_admin
  3. You can confirm the role assignment by listing out the roles:

    openstack role assignment list --user <username>

    Example output:

    ardana >  openstack role assignment list --user testuser
    +----------------------------------+----------------------------------+-------+----------------------------------+--------+-----------+
    | Role                             | User                             | Group | Project                          | Domain | Inherited |
    +----------------------------------+----------------------------------+-------+----------------------------------+--------+-----------+
    | 8cdb02bab38347f3b65753099f3ab73c | 8bcfe10101964e0c8ebc4de391f3e345 |       | 0ebbf7640d7948d2a17ac08bbbf0ca5b |        | False     |
    +----------------------------------+----------------------------------+-------+----------------------------------+--------+-----------+
  4. Note that only the role ID is displayed. To get the role name, execute the following:

    openstack role show <role_id>

    Example output:

    ardana >  openstack role show 8cdb02bab38347f3b65753099f3ab73c
    +-------+----------------------------------+
    | Field | Value                            |
    +-------+----------------------------------+
    | id    | 8cdb02bab38347f3b65753099f3ab73c |
    | name  | nova_admin                       |
    +-------+----------------------------------+
  5. To demonstrate that the user has nova admin privileges, authenticate with those user creds and then upload and publish an image. Only a user with an admin role or glance_admin can publish an image.

    1. The easiest way to do this will be to make a copy of the service.osrc file and edit it with your user credentials. You can do that with this command:

      cp ~/service.osrc ~/user.osrc
    2. Using your preferred editor, edit the user.osrc file and replace the values for the following entries to match your user credentials:

      export OS_USERNAME=<username>
      export OS_PASSWORD=<password>
    3. You will also need to edit the following lines for your environment:

      ## Change these values from 'unset' to 'export'
      export OS_PROJECT_NAME=<project_name>
      export OS_PROJECT_DOMAIN_NAME=Default

      Here is an example output:

      unset OS_DOMAIN_NAME
      export OS_IDENTITY_API_VERSION=3
      export OS_AUTH_VERSION=3
      export OS_PROJECT_NAME=test_project
      export OS_PROJECT_DOMAIN_NAME=Default
      export OS_USERNAME=testuser
      export OS_USER_DOMAIN_NAME=Default
      export OS_PASSWORD=testuser
      export OS_AUTH_URL=http://192.168.245.9:35357/v3
      export OS_ENDPOINT_TYPE=internalURL
      # OpenstackClient uses OS_INTERFACE instead of OS_ENDPOINT
      export OS_INTERFACE=internal
      export OS_CACERT=/etc/ssl/certs/ca-certificates.crt
  6. Source the environment variables for your user:

    source ~/user.osrc
  7. List all of the virtual machines in the project specified in user.osrc:

    openstack server list

    Example output showing no virtual machines, because there are no virtual machines created on the project specified in the user.osrc file:

    +--------------------------------------+-------------------------------------------------------+--------+-----------------------------------------------------------------+
    | ID                                   | Name                                                  | Status | Networks                                                        |
    +--------------------------------------+-------------------------------------------------------+--------+-----------------------------------------------------------------+
    +--------------------------------------+-------------------------------------------------------+--------+-----------------------------------------------------------------+
  8. For this demonstration, we do have a virtual machine associated with a different project and because your user has nova_admin permissions, you can view those virtual machines using a slightly different command:

    openstack server list --all-projects

    Example output, now showing a virtual machine:

    ardana >  openstack server list --all-projects
    +--------------------------------------+-------------------------------------------------------+--------+-----------------------------------------------------------------+
    | ID                                   | Name                                                  | Status | Networks                                                        |
    +--------------------------------------+-------------------------------------------------------+--------+-----------------------------------------------------------------+
    | da4f46e2-4432-411b-82f7-71ab546f91f3 | testvml                                               | ACTIVE |                                                                 |
    +--------------------------------------+-------------------------------------------------------+--------+-----------------------------------------------------------------+
  9. You can also now delete virtual machines in other projects by using the --all-tenants switch:

    openstack server delete --all-projects <instance_id>

    Example, showing us deleting the instance in the previous step:

    openstack server delete --all-projects da4f46e2-4432-411b-82f7-71ab546f91f3
  10. You can get a full list of available commands by using this:

    openstack -h

You can perform the same steps as above for the neutron and cinder service admin roles:

neutron_admin
cinder_admin

2.3.4 Customize policy.json on the Cloud Lifecycle Manager

One way to deploy policy.json for a service is by going to each of the target nodes and making changes there. This is not necessary anymore. This process has been streamlined and policy.json files can be edited on the Cloud Lifecycle Manager and then deployed to nodes. Please exercise caution when modifying policy.json files. It is best to validate the changes in a non-production environment before rolling out policy.json changes into production. It is not recommended that you make policy.json changes without a way to validate the desired policy behavior. Updated policy.json files can be deployed using the appropriate <service_name>-reconfigure.yml playbook.

2.3.5 Roles

Service roles represent the functionality used to implement the OpenStack role based access control (RBAC) model. This is used to manage access to each OpenStack service. Roles are named and assigned per user or group for each project by the identity service. Role definition and policy enforcement are defined outside of the identity service independently by each OpenStack service.

The token generated by the identity service for each user authentication contains the role(s) assigned to that user for a particular project. When a user attempts to access a specific OpenStack service, the role is parsed by the service, compared to the service-specific policy file, and then granted the resource access defined for that role by the service policy file.

Each service has its own service policy file with the /etc/[SERVICE_CODENAME]/policy.json file name format where [SERVICE_CODENAME] represents a specific OpenStack service name. For example, the OpenStack nova service would have a policy file called /etc/nova/policy.json.

Service policy files can be modified and deployed to control nodes from the Cloud Lifecycle Manager. Administrators are advised to validate policy changes before checking in the changes to the site branch of the local git repository before rolling the changes into production. Do not make changes to policy files without having a way to validate them.

The policy files are located at the following site branch directory on the Cloud Lifecycle Manager.

~/openstack/ardana/ansible/roles/

For test and validation, policy files can be modified in a non-production environment from the ~/scratch/ directory. For a specific policy file, run a search for policy.json. To deploy policy changes for a service, run the service specific reconfiguration playbook (for example, nova-reconfigure.yml). For a complete list of reconfiguration playbooks, change directories to ~/scratch/ansible/next/ardana/ansible and run this command:

ls –l | grep reconfigure
Note
Note

Comments added to any *.j2 files (including templates) must follow proper comment syntax. Otherwise you may see errors when running the config-processor or any of the service playbooks.

2.4 Log Management and Integration

2.4.1 Overview

SUSE OpenStack Cloud uses the ELK (Elasticsearch, Logstash, Kibana) stack for log management across the entire cloud infrastructure. This configuration facilitates simple administration as well as integration with third-party tools. This tutorial covers how to forward your logs to a third-party tool or service, and how to access and search the Elasticsearch log stores through API endpoints.

2.4.2 The ELK stack

The ELK logging stack consists of the Elasticsearch, Logstash, and Kibana elements.

  • Elasticsearch.  Elasticsearch is the storage and indexing component of the ELK stack. It stores and indexes the data received from Logstash. Indexing makes your log data searchable by tools designed for querying and analyzing massive sets of data. You can query the Elasticsearch datasets from the built-in Kibana console, a third-party data analysis tool, or through the Elasticsearch API (covered later).

  • Logstash.  Logstash reads the log data from the services running on your servers, and then aggregates and ships that data to a storage location. By default, Logstash sends the data to the Elasticsearch indexes, but it can also be configured to send data to other storage and indexing tools such as Splunk.

  • Kibana.  Kibana provides a simple and easy-to-use method for searching, analyzing, and visualizing the log data stored in the Elasticsearch indexes. You can customize the Kibana console to provide graphs, charts, and other visualizations of your log data.

2.4.3 Using the Elasticsearch API

You can query the Elasticsearch indexes through various language-specific APIs, as well as directly over the IP address and port that Elasticsearch exposes on your implementation. By default, Elasticsearch presents from localhost, port 9200. You can run queries directly from a terminal using curl. For example:

ardana > curl -XGET 'http://localhost:9200/_search?q=tag:yourSearchTag'

The preceding command searches all indexes for all data with the "yourSearchTag" tag.

You can also use the Elasticsearch API from outside the logging node. This method connects over the Kibana VIP address, port 5601, using basic http authentication. For example, you can use the following command to perform the same search as the preceding search:

curl -u kibana:<password> kibana_vip:5601/_search?q=tag:yourSearchTag

You can further refine your search to a specific index of data, in this case the "elasticsearch" index:

ardana > curl -XGET 'http://localhost:9200/elasticsearch/_search?q=tag:yourSearchTag'

The search API is RESTful, so responses are provided in JSON format. Here's a sample (though empty) response:

{
    "took":13,
    "timed_out":false,
    "_shards":{
        "total":45,
        "successful":45,
        "failed":0
    },
    "hits":{
        "total":0,
        "max_score":null,
        "hits":[]
    }
}

2.4.4 For More Information

You can find more detailed Elasticsearch API documentation at https://www.elastic.co/guide/en/elasticsearch/reference/current/search.html.

Review the Elasticsearch Python API documentation at the following sources: http://elasticsearch-py.readthedocs.io/en/master/api.html

Read the Elasticsearch Java API documentation at https://www.elastic.co/guide/en/elasticsearch/client/java-api/current/index.html.

2.4.5 Forwarding your logs

You can configure Logstash to ship your logs to an outside storage and indexing system, such as Splunk. Setting up this configuration is as simple as editing a few configuration files, and then running the Ansible playbooks that implement the changes. Here are the steps.

  1. Begin by logging in to the Cloud Lifecycle Manager.

  2. Verify that the logging system is up and running:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts logging-status.yml

    When the preceding playbook completes without error, proceed to the next step.

  3. Edit the Logstash configuration file, found at the following location:

    ~/openstack/ardana/ansible/roles/logging-server/templates/logstash.conf.j2

    Near the end of the Logstash configuration file, you will find a section for configuring Logstash output destinations. The following example demonstrates the changes necessary to forward your logs to an outside server (changes in bold). The configuration block sets up a TCP connection to the destination server's IP address over port 5514.

    # Logstash outputs
        output {
          # Configure Elasticsearch output
          # http://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html
          elasticsearch {
            index => "${[@metadata][es_index]"}
            hosts => ["{{ elasticsearch_http_host }}:{{ elasticsearch_http_port }}"]
            flush_size => {{ logstash_flush_size }}
            idle_flush_time => 5
            workers => {{ logstash_threads }}
          }
            # Forward Logs to Splunk on TCP port 5514 which matches the one specified in Splunk Web UI.
          tcp {
            mode => "client"
            host => "<Enter Destination listener IP address>"
            port => 5514
          }
        }

    Logstash can forward log data to multiple sources, so there is no need to remove or alter the Elasticsearch section in the preceding file. However, if you choose to stop forwarding your log data to Elasticsearch, you can do so by removing the related section in this file, and then continue with the following steps.

  4. Commit your changes to the local git repository:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "Your commit message"
  5. Run the configuration processor to check the status of all configuration files:

    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  6. Run the ready-deployment playbook:

    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  7. Implement the changes to the Logstash configuration file:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts kronos-server-configure.yml

Configuring the receiving service will vary from product to product. Consult the documentation for your particular product for instructions on how to set it up to receive log files from Logstash.

2.5 Integrating Your Logs with Splunk

2.5.1 Integrating with Splunk

The SUSE OpenStack Cloud 9 logging solution provides a flexible and extensible framework to centralize the collection and processing of logs from all nodes in your cloud. The logs are shipped to a highly available and fault-tolerant cluster where they are transformed and stored for better searching and reporting. The SUSE OpenStack Cloud 9 logging solution uses the ELK stack (Elasticsearch, Logstash and Kibana) as a production-grade implementation and can support other storage and indexing technologies.

You can configure Logstash, the service that aggregates and forwards the logs to a searchable index, to send the logs to a third-party target, such as Splunk.

For how to integrate the SUSE OpenStack Cloud 9 centralized logging solution with Splunk, including the steps to set up and forward logs, please refer to Section 4.1, “Splunk Integration”.

2.6 Integrating SUSE OpenStack Cloud with an LDAP System

You can configure your SUSE OpenStack Cloud cloud to work with an outside user authentication source such as Active Directory or OpenLDAP. keystone, the SUSE OpenStack Cloud identity service, functions as the first stop for any user authorization/authentication requests. keystone can also function as a proxy for user account authentication, passing along authentication and authorization requests to any LDAP-enabled system that has been configured as an outside source. This type of integration lets you use an existing user-management system such as Active Directory and its powerful group-based organization features as a source for permissions in SUSE OpenStack Cloud.

Upon successful completion of this tutorial, your cloud will refer user authentication requests to an outside LDAP-enabled directory system, such as Microsoft Active Directory or OpenLDAP.

2.6.1 Configure your LDAP source

To configure your SUSE OpenStack Cloud cloud to use an outside user-management source, perform the following steps:

  1. Make sure that the LDAP-enabled system you plan to integrate with is up and running and accessible over the necessary ports from your cloud management network.

  2. Edit the /var/lib/ardana/openstack/my_cloud/config/keystone/keystone.conf.j2 file and set the following options:

    domain_specific_drivers_enabled = True
    domain_configurations_from_database = False
  3. Create a YAML file in the /var/lib/ardana/openstack/my_cloud/config/keystone/ directory that defines your LDAP connection. You can make a copy of the sample keystone-LDAP configuration file, and then edit that file with the details of your LDAP connection.

    The following example copies the keystone_configure_ldap_sample.yml file and names the new file keystone_configure_ldap_my.yml:

    ardana > cp /var/lib/ardana/openstack/my_cloud/config/keystone/keystone_configure_ldap_sample.yml \
      /var/lib/ardana/openstack/my_cloud/config/keystone/keystone_configure_ldap_my.yml
  4. Edit the new file to define the connection to your LDAP source. This guide does not provide comprehensive information on all aspects of the keystone_configure_ldap.yml file. Find a complete list of keystone/LDAP configuration file options at: https://github.com/openstack/keystone/tree/stable/rocky/etc

    The following file illustrates an example keystone configuration that is customized for an Active Directory connection.

    keystone_domainldap_conf:
    
        # CA certificates file content.
        # Certificates are stored in Base64 PEM format. This may be entire LDAP server
        # certificate (in case of self-signed certificates), certificate of authority
        # which issued LDAP server certificate, or a full certificate chain (Root CA
        # certificate, intermediate CA certificate(s), issuer certificate).
        #
        cert_settings:
          cacert: |
            -----BEGIN CERTIFICATE-----
    
            certificate appears here
    
            -----END CERTIFICATE-----
    
        # A domain will be created in MariaDB with this name, and associated with ldap back end.
        # Installer will also generate a config file named /etc/keystone/domains/keystone.<domain_name>.conf
        #
        domain_settings:
          name: ad
          description: Dedicated domain for ad users
    
        conf_settings:
          identity:
             driver: ldap
    
    
          # For a full list and description of ldap configuration options, please refer to
          # http://docs.openstack.org/liberty/config-reference/content/keystone-configuration-file.html.
          #
          # Please note:
          #  1. LDAP configuration is read-only. Configuration which performs write operations (i.e. creates users, groups, etc)
          #     is not supported at the moment.
          #  2. LDAP is only supported for identity operations (reading users and groups from LDAP). Assignment
          #     operations with LDAP (i.e. managing roles, projects) are not supported.
          #  3. LDAP is configured as non-default domain. Configuring LDAP as a default domain is not supported.
          #
    
          ldap:
            url: ldap://YOUR_COMPANY_AD_URL
            suffix: YOUR_COMPANY_DC
            query_scope: sub
            user_tree_dn: CN=Users,YOUR_COMPANY_DC
            user : CN=admin,CN=Users,YOUR_COMPANY_DC
            password: REDACTED
            user_objectclass: user
            user_id_attribute: cn
            user_name_attribute: cn
            group_tree_dn: CN=Users,YOUR_COMPANY_DC
            group_objectclass: group
            group_id_attribute: cn
            group_name_attribute: cn
            use_pool: True
            user_enabled_attribute: userAccountControl
            user_enabled_mask: 2
            user_enabled_default: 512
            use_tls: True
            tls_req_cert: demand
            # if you are configuring multiple LDAP domains, and LDAP server certificates are issued
            # by different authorities, make sure that you place certs for all the LDAP backend domains in the
            # cacert parameter as seen in this sample yml file so that all the certs are combined in a single CA file
            # and every LDAP domain configuration points to the combined CA file.
            # Note:
            # 1. Please be advised that every time a new ldap domain is configured, the single CA file gets overwritten
            # and hence ensure that you place certs for all the LDAP backend domains in the cacert parameter.
            # 2. There is a known issue on one cert per CA file per domain when the system processes
            # concurrent requests to multiple LDAP domains. Using the single CA file with all certs combined
            # shall get the system working properly.
    
            tls_cacertfile: /etc/keystone/ssl/certs/all_ldapdomains_ca.pem
  5. Add your new file to the local Git repository and commit the changes.

    ardana > cd ~/openstack
    ardana > git checkout site
    ardana > git add -A
    ardana > git commit -m "Adding LDAP server integration config"
  6. Run the configuration processor and deployment preparation playbooks to validate the YAML files and prepare the environment for configuration.

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  7. Run the keystone reconfiguration playbook to implement your changes, passing the newly created YAML file as an argument to the -e@FILE_PATH parameter:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts keystone-reconfigure.yml \
      -e@/var/lib/ardana/openstack/my_cloud/config/keystone/keystone_configure_ldap_my.yml

    To integrate your SUSE OpenStack Cloud cloud with multiple domains, repeat these steps starting from Step 3 for each domain.

3 Cloud Lifecycle Manager Admin UI User Guide

The Cloud Lifecycle Manager Admin UI is a web-based GUI for viewing and managing the configuration of an installed cloud. After successfully deploying the cloud with the Install UI, the final screen displays a link to the CLM Admin UI. (For example, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 21 “Installing with the Install UI”, Section 21.5 “Running the Install UI”, Cloud Deployment Successful). Usually the URL associated with this link is https://DEPLOYER_MGMT_NET_IP:9085, although it may be different depending on the cloud configuration and the installed version of SUSE OpenStack Cloud.

3.1 Accessing the Admin UI

In a browser, go to https://DEPLOYER_MGMT_NET_IP:9085.

The DEPLOYER_MGMT_NET_IP:PORT_NUMBER is not necessarily the same for all installations, and can be displayed with the following command:

ardana > openstack endpoint list --service ardana --interface admin -c URL

Accessing the Cloud Lifecycle Manager Admin UI requires access to the MANAGEMENT network that was configured when the Cloud was deployed. Access to this network is necessary to be able to access the Cloud Lifecycle Manager Admin UI and log in. Depending on the network setup, it may be necessary to use an SSH tunnel similar to what is recommended in Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 21 “Installing with the Install UI”, Section 21.5 “Running the Install UI”. The Admin UI requires keystone and HAProxy to be running and to be accesible. If keystone or HAProxy are not running, cloud reconfiguration is limited to the command line.

Logging in requires a keystone user. If the user is not an admin on the default domain and one or more projects, the Cloud Lifecycle Manager Admin UI will not display information about the Cloud and may present errors.

Cloud Lifecycle Manager Admin UI Login Page
Figure 3.1: Cloud Lifecycle Manager Admin UI Login Page

3.2 Admin UI Pages

3.2.1 Services

Services pages relay information about the various OpenStack and other services that have been deployed as part of the cloud. Service information displays the list of services registered with keystone and the endpoints associated with those services. The information is equivalent to running the command openstack endpoint list.

The Service Information table contains the following information, based on how the service is registered with keystone:

Name

The name of the service, this may be an OpenStack code name

Description

Service description, for some services this is a repeat of the name

Endpoints

Services typically have 1 or more endpoints that are accessible to make API calls. The most common configuration is for a service to have Admin, Public, and Internal endpoints, with each intended for access by consumers corresponding to the type of endpoint.

Region

Service endpoints are part of a region. In multi-region clouds, some services will have endpoints in multiple regions.

Cloud Lifecycle Manager Admin UI Service Information
Figure 3.2: Cloud Lifecycle Manager Admin UI Service Information

3.2.2 Packages

The Packages tab displays packages that are part of the SUSE OpenStack Cloud product.

The SUSE Cloud Packages table contains the following:

Name

The name of the SUSE Cloud package

Version

The version of the package which is installed in the Cloud

Cloud Lifecycle Manager Admin UI SUSE Cloud Package
Figure 3.3: Cloud Lifecycle Manager Admin UI SUSE Cloud Package
Note
Note

Packages with the venv- prefix denote the version of the specific OpenStack package that is deployed. The release name can be determined from the OpenStack Releases page.

3.2.3 Configuration

The Configuration tab displays services that are deployed in the cloud and the configuration files associated with those services. Services may be reconfigured by editing the .j2 files listed and clicking the Update button.

This page also provides the ability to set up SUSE Enterprise Storage Integration after initial deployment.

Cloud Lifecycle Manager Admin UI SUSE Service Configuration
Figure 3.4: Cloud Lifecycle Manager Admin UI SUSE Service Configuration

Clicking one of the listed configuration files opens the file editor where changes can be made. Asterisks identify files that have been edited but have not had their updates applied to the cloud.

Cloud Lifecycle Manager Admin UI SUSE Service Configuration Editor
Figure 3.5: Cloud Lifecycle Manager Admin UI SUSE Service Configuration Editor

After editing the service configuration, click the Update button to begin deploying configuration changes to the cloud. The status of those changes will be streamed to the UI.

Configure SUSE Enterprise Storage After Initial Deployment

A link to the settings.yml file is available under the ses selection on the Configuration tab.

To set up SUSE Enterprise Storage Integration:

  1. Click on the link to edit the settings.yml file.

  2. Uncomment the ses_config_path parameter, specify the location on the deployer host containing the ses_config.yml file, and save the settings.yml file.

  3. If the ses_config.yml file does not yet exist in that location on the deployer host, a new link will appear for uploading a file from your local workstation.

  4. When ses_config.yml is present on the deployer host, it will appear in the ses section of the Configuration tab and can be edited directly there.

Note
Note

If the cloud is configured using self-signed certificates, the streaming status updates (including the log) may be interupted and require a reload of the CLM Admin UI. See Book “Security Guide”, Chapter 8 “Transport Layer Security (TLS) Overview”, Section 8.2 “TLS Configuration” for details on using signed certificates.

Cloud Lifecycle Manager Admin UI SUSE Service Configuration Update
Figure 3.6: Cloud Lifecycle Manager Admin UI SUSE Service Configuration Update

3.2.4 Model

The Model tab displays input models that are deployed in the cloud and the associated model files. The model files listed can be modified.

Cloud Lifecycle Manager Admin UI SUSE Service Model
Figure 3.7: Cloud Lifecycle Manager Admin UI SUSE Service Model

Clicking one of the listed model files opens the file editor where changes can be made. Asterisks identify files that have been edited but have not had their updates applied to the cloud.

Cloud Lifecycle Manager Admin UI SUSE Service Model Editor
Figure 3.8: Cloud Lifecycle Manager Admin UI SUSE Service Model Editor

After editing the model file, click the Validate button to validate changes. If validation is successful, Update is enabled. Click the Update button to deploy the changes to the cloud. Before starting deployment, a confirmation dialog shows the choices of only running config-processor-run.yml and ready-deployment.yml playbooks or running a full deployment. It also indicates the risk of updating the deployed cloud.

Cloud Lifecycle Manager Admin UI SUSE Service Model Confirmation
Figure 3.9: Cloud Lifecycle Manager Admin UI SUSE Service Model Confirmation

Click Update to start deployment. The status of the changes will be streamed to the UI.

Note
Note

If the cloud is configured using self-signed certificates, the streaming status updates (including the log) may be interrupted. The CLM Admin UI must be reloaded. See Book “Security Guide”, Chapter 8 “Transport Layer Security (TLS) Overview”, Section 8.2 “TLS Configuration” for details on using signed certificates.

Cloud Lifecycle Manager Admin UI SUSE Service Model Update
Figure 3.10: Cloud Lifecycle Manager Admin UI SUSE Service Model Update

3.2.5 Roles

The Services Per Role tab displays the list of all roles that have been defined in the Cloud Lifecycle Manager input model, the list of servers that role, and the services installed on those servers.

The Services Per Role table contains the following:

Role

The name of the role in the data model. In the included data model templates, these names are descriptive, such as MTRMON-ROLE for a metering and monitoring server. There is no strict constraint on role names and they may have been altered at install time.

Servers

The model IDs for the servers that have been assigned this role. This does not necessarily correspond to any DNS or other naming labels a host has, unless the host ID was set that way during install.

Services

A list of OpenStack and other Cloud related services that comprise this role. Servers that have been assigned this role will have these services installed and enabled.

Cloud Lifecycle Manager Admin UI Services Per Role
Figure 3.11: Cloud Lifecycle Manager Admin UI Services Per Role

3.2.6 Servers

The Servers pages contain information about the hardware that comprises the cloud, including the configuration of the servers, and the ability to add new compute nodes to the cloud.

The Servers table contains the following information:

ID

This is the ID of the server in the data model. This does not necessarily correspond to any DNS or other naming labels a host has, unless the host ID was set that way during install.

IP Address

The management network IP address of the server

Server Group

The server group which this server is assigned to

NIC Mapping

The NIC mapping that describes the PCI slot addresses for the servers ethernet adapters

Mac Address

The hardware address of the servers primary physical ethernet adapter

Cloud Lifecycle Manager Admin UI Server Summary
Figure 3.12: Cloud Lifecycle Manager Admin UI Server Summary

3.2.7 Admin UI Server Details

Server Details can be viewed by clicking the menu at the right side of each row in the Servers table, the server details dialog contains the information from the Servers table and the following additional fields:

IPMI IP Address

The IPMI network address, this may be empty if the server was provisioned prior to being added to the Cloud

IPMI Username

The username that was specified for IPMI access

IPMI Password

This is obscured in the readonly dialog, but is editable when adding a new server

Network Interfaces

The network interfaces configured on the server

Filesystem Utilization

Filesystem usage (percentage of filesystem in use). Only available if monasca is in use

Server Details (1/2)
Figure 3.13: Server Details (1/2)
Server Details (2/2)
Figure 3.14: Server Details (2/2)

3.3 Topology

The topology section of the Cloud Lifecycle Manager Admin UI displays an overview of how the Cloud is configured. Each section of the topology represents some facet of the Cloud configuration and provides a visual layout of the way components are associated with each other. Many of the components in the topology are linked to each other, and can be navigated between by clicking on any component that appears as a hyperlink.

3.3.1 Control Planes

The Control Planes tab displays control planes and availability zones within the Cloud.

Each control plane is show as a table of clusters, resources, and load balancers (represented by vertical columns in the table).

Control Plane

A set of servers dedicated to running the infrastructure of the Cloud. Many Cloud configurations will have only a single control plane.

Clusters

A set of one or more servers hosting a particular set of services, tied to the role that has been assigned to that server. Clusters are generally differentiated from Resources in that they are fixed size groups of servers that do not grow as the Cloud grows.

Resources

Servers hosting the scalable parts of the Cloud, such as Compute Hosts that host VMs, or swift servers for object storage. These will vary in number with the size and scale of the Cloud and can generally be increased after the initial Cloud deployment.

Load Balancers

Servers that distribute API calls across servers hosting the called services.

Control Plane Topology
Figure 3.15: Control Plane Topology
Availability Zones

Listed beneath the running services, groups together in a row the hosts in a particular availability zone for a particular cluster or resource type (the rows are AZs, the columns are clusters/resources)

Control Plane Topology - Availability Zones
Figure 3.16: Control Plane Topology - Availability Zones

3.3.2 Regions

Displays the distribution of control plane services across regions. Clouds that have only a single region will list all services in the same cell.

Control Planes

The group of services that run the Cloud infrastructure

Region

Each region will be represented by a column with the region name as the column header. The list of services that are running in that region will be in that column, with each row corresponding to a particular control plane.

Regions Topology
Figure 3.17: Regions Topology

3.3.3 Services

A list of services running in the Cloud, organized by the type (class) of service. Each service is then listed along with the control planes that the service is part of, the other services that each particular service consumes (requires), and the endpoints of the service, if the service exposes an API.

Class

A category of like services, such as "security" or "operations". Multiple services may belong to the same category.

Description

A short description of the service, typically sourced from the service itself

Service

The name of the service. For OpenStack services, this is the project codename, such as nova for virtual machine provisioning. Clicking a service will navigate to the section of this page with details for that particular service.

Services Topology
Figure 3.18: Services Topology

The detail data about a service provides additional insight into the service, such as what other services are required to run a service, and what network protocols can be used to access the service

Components

Each service is made up of one or more components, which are listed separately here. The components of a service may represent pieces of the service that run on different hosts, provide distinct functionality, or modularize business logic.

Control Planes

A service may be running in multiple control planes. Each control plane that a service is running in will be listed here.

Consumes

Other services required for this service to operate correctly.

Endpoints

How a service can be accessed, typically a REST API, though other network protocols may be listed here. Services that do not expose an API or have any sort of external access will not list any entries here.

Service Details Topology
Figure 3.19: Service Details Topology

3.3.4 Networks

Lists the networks and network groups that comprise the Cloud. Each network group is respresented by a row in the table, with columns identifying which networks are used by the intersection of the group (row) and cluster/resource (column).

Group

The network group

Clusters

A set of one or more servers hosting a particular set of services, tied to the role that has been assigned to that server. Clusters are generally differentiated from Resources in that they are fixed size groups of servers that do not grow as the Cloud grows.

Resources

Servers hosting the scalable parts of the Cloud, such as Compute Hosts that host VMs, or swift servers for object storage. These will vary in number with the size and scale of the Cloud and can generally be increased after the initial Cloud deployment.

Cells in the middle of the table represent the network that is running on the resource/cluster represented by that column and is part of the network group identified in the leftmost column of the same row.

Networks Topology
Figure 3.20: Networks Topology

Each network group is listed along with the servers and interfaces that comprise the network group.

Network Group

The elements that make up the network group, whose name is listed above the table

Networks

Networks that are part of the specified network group

Address

IP address of the corresponding server

Server

Server name of the server that is part of this network. Clicking on a server will load the server topology details.

Interface Model

The particular combination of hardware address and bonding that tie this server to the specified network group. Clicking on an Interface Model will load the corresponding section of the Roles page.

Network Groups Topology
Figure 3.21: Network Groups Topology

3.3.5 Servers

A hierarchical display of the tree of Server Groups. Groups will be represented by a heading with their name, starting with the first row which contains the Cloud-wide server group (often called CLOUD). Within each Server Group, the Network Groups, Networks, Servers, and Server Roles are broken down. Note that server groups can be nested, producing a tree-like structure of groups.

Network Groups

The network groups that are part of this server group.

Networks

The network that is part of the server group and corresponds to the network group in the same row.

Server Roles

The model defined role that was applied to the server, made up of a combination of services, and network/storage configurations unique to that role within the Cloud

Servers

The servers that have the role defined in their row and are part of the network group represented by the column the server is in.

Server Groups Topology
Figure 3.22: Server Groups Topology

3.3.6 Roles

The list of server roles that define the server configurations for the Cloud. Each server role consists of several configurations. In this topology the focus is on the Disk Models and Network Interface Models that are applied to the servers with that role.

Server Role

The name of the role, as it is defined in the model

Disk Model

The name of the disk model

Volume Group

Name of the volume group

Mount

Name of the volume being mounted on the server

Size

The size of the volume as a percentage of physical disk space

FS Type

Filesystem type

Options

Optional flags applied when mounting the volume

PVol(s)

The physical address to the storage used for this volume group

Interface Model

The name of the interface model

Network Group

The name of the network group. Clicking on a Network Group will load the details of that group on the Networks page.

Interface/Options

Includes logical network name, such as hed1, hed2, and bond information grouping the logical network name together. The Cloud software will map these to physical devices.

Roles Topology
Figure 3.23: Roles Topology

3.4 Server Management

3.4.1 Adding Servers

The Add Server page in the Cloud Lifecycle Manager Admin UI allows for adding additional Compute Nodes to the Cloud.

Add Server Overview
Figure 3.24: Add Server Overview

3.4.1.1 Available Servers

Servers that can be added to the Cloud are shown on the left side of the Add Server screen. Additional servers can be included in this list three different ways:

  1. Discover servers via SUSE Manager or HPE OneView (for details on adding servers via autodiscovery, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 21 “Installing with the Install UI”, Section 21.4 “Optional: Importing Certificates for SUSE Manager and HPE OneView” and Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 21 “Installing with the Install UI”, Section 21.5 “Running the Install UI”

  2. Manually add servers individually by clicking Manual Entry and filling out the form with the server information (instructions below)

  3. Create a CSV file of the servers to be added (see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 21 “Installing with the Install UI”, Section 21.3 “Optional: Creating a CSV File to Import Server Data”)

Manually adding a server requires the following fields:

ID

A unique name for the server

IP Address

The IP address that the server has, or will have, in the Cloud

Server Group

Which server group the server will belong to. The IP address must be compatible with the selected Server Group. If the required Server Group is not present, it can be created

NIC Mapping

The NIC to PCI address mapping for the server being added to the Cloud. If the required NIC mapping is not present, it can be created

Role

Which compute role to add the server to. If this is set, the server will be immediately assigned that role on the right side of the page. If it is not set, the server will be added to the left side panel of available servers

Some additional fields must be set if the server is not already provisioned with an OS, or if a new OS install is desired for the server. These fields are not required if an OpenStack Cloud compatible OS is already installed:

MAC Address

The MAC address of the IPMI network card of the server

IPMI IP Address

The IPMI network address (IP address) of the server

IPMI Username

Username to log in to IPMI on the server

IPMI Password

Password to log in to IPMI on the server

Manually Add Server
Figure 3.25: Manually Add Server

Servers in the available list can be dragged to the desired role on the right. Only Compute-related roles will be displayed.

Manually Add Server
Figure 3.26: Manually Add Server

3.4.1.2 Add Server Settings

There are several settings that apply across all Compute Nodes being added to the Cloud. Beneath the list of nodes, users will find options to control whether existing nodes can be modified, whether the new nodes should have their data disks wiped, and whether to activate the new Compute Nodes as part of the update process.

Safe Mode

Prevents modification of existing Compute Nodes. Can be unchecked to allow modifications. Modifying existing Compute Nodes has the potential to disrupt the continuous operation of the Cloud and should be done with caution.

Wipe Data Disks

The data disks on the new server will not be wiped by default, but users can specify to wipe clean the data disks as part of the process of adding the Compute Node(s) to the Cloud.

Activate

Activates the added Compute Node(s) during the process of adding them to the Cloud. Activation adds a Compute Node to the pool of nodes that the nova-scheduler uses when instantiating VMs.

Add Server Settings options
Figure 3.27: Add Server Settings options

3.4.1.3 Install OS

Servers that have been assigned a role but not yet deployed can have SLES installed as part of the Cloud deployment. This step is necessary for servers that are not provisioned with an OS.

On the Install OS page, the Available Servers list will be populated with servers that have been assigned to a role but not yet deployed to the Cloud. From here, select which servers to install an OS onto and use the arrow controls to move them to the Selected Servers box on the right. After all servers that require an OS to be provisioned have been added to the Selected Servers list and click Next.

Select Servers to Provision OS
Figure 3.28: Select Servers to Provision OS

The UI will prompt for confirmation that the OS should be installed, because provisioning an OS will replace any existing operating system on the server.

Confirm Provision OS
Figure 3.29: Confirm Provision OS

When the OS install begins, progress of the install will be displayed on screen

OS Install Progress
Figure 3.30: OS Install Progress

After OS provisioning is complete, a summary of the provisioned servers will be displayed. Clicking Close will return the user to the role selection page where deployment can continue.

OS Install Summary
Figure 3.31: OS Install Summary

3.4.1.4 Deploy New Servers

When all newly added servers have an OS provisioned, either via the Install OS process detailed above or having previously been provisioned outside of the Cloud Lifecycle Manager Admin UI, deployment can begin.

The Deploy button will be enabled when one or more new servers have been assigned roles. Clicking Deploy prompt for confirmation before beginning the deployment process

Confirm Deploy Servers
Figure 3.32: Confirm Deploy Servers

The deployment process will begin by running the Configuration Processor in basic validation mode to check the values input for the servers being added. This will check IP addresses, server groups, and NIC mappings for syntax or format errors.

Validate Server Changes
Figure 3.33: Validate Server Changes

After validation is successful, the servers will be prepared for deployment. The preparation consists of running the full Configuration Processor and two additional playbooks to ready servers for deployment.

Prepare Servers
Figure 3.34: Prepare Servers

After the servers have been prepared, deployment can begin. This process will generate a new hosts file, run the site.yml playbook, and update monasca (if monasca is deployed)

Deploy Servers
Figure 3.35: Deploy Servers

When deployment is completed, a summary page will be displayed. Clicking Close will return to the Add Server page.

Deploy Summary
Figure 3.36: Deploy Summary

3.4.2 Activating Servers

The Server Summary page in the Cloud Lifecycle Manager Admin UI allows for activating Compute Nodes in the Cloud. Compute Nodes may be activated when they are added to the Cloud. An activated compute node is available for the nova-scheduler to use for hosting new VMs that are created. Only servers that are not currently activated will have the activation menu option available.

Activate Server
Figure 3.37: Activate Server

Once activation is triggered, the progress of activating the node and adding it to the nova-scheduler is displayed.

Activate Server Progress
Figure 3.38: Activate Server Progress

3.4.3 Deactivating Servers

The Server Summary page in the Cloud Lifecycle Manager Admin UI allows for deactivating Compute Nodes in the Cloud. Deactivating a Compute Node removes it from the pool of servers that the nova-scheduler will put VMs on. When a Compute Node is deactivated, the UI attempts to migrate any currently running VMs from that server to an active node.

Deactivate Server
Figure 3.39: Deactivate Server

The deactivation process requires confirmation before proceeding.

Deactivate Server Confirmation
Figure 3.40: Deactivate Server Confirmation

Once deactivation is triggered, the progress of deactivating the node and removing it from the nova-scheduler is displayed.

Deactivate Server Progress
Figure 3.41: Deactivate Server Progress

If a Compute Node selected for deactivation has VMs running on it, a prompt will appear to select where to migrate the running VMs

Select Migration Target
Figure 3.42: Select Migration Target

A summary of the VMs being migrated will be displayed, along with the progress migrating them from the deactivated Compute Node to the target host. Once the migration attempt is complete, click 'Done' to continue the deactivation process.

Deactivate Migration Progress
Figure 3.43: Deactivate Migration Progress

3.4.4 Deleting Servers

The Server Summary page in the Cloud Lifecycle Manager Admin UI allows for deleting Compute Nodes from the Cloud. Deleting a Compute Node removes it from the cloud. Only Compute Nodes that are deactivated can be deleted.

Delete Server
Figure 3.44: Delete Server

The deletion process requires confirmation before proceeding.

Delete Server Confirmation
Figure 3.45: Delete Server Confirmation

If the Compute Node is not reachable (SSH from the deployer is not possible), a warning will appear, requesting confirmation that the node is shut down or otherwise removed from the environment. Reachable Compute Nodes will be shutdown as part of the deletion process.

Unreachable Delete Confirmation
Figure 3.46: Unreachable Delete Confirmation

The progress of deleting the Compute Node will be displayed, including a streaming log with additional details of the running playbooks.

Delete Server Progress
Figure 3.47: Delete Server Progress

3.5 Server Replacement

The process of replacing a server is initiated from the Server Summary (see Section 3.2.6, “Servers”). Replacing a server will remove the existing server from the Cloud configuration and install the new server in its place. The rest of this process varies slightly depending on the type of server being replaced.

3.5.1 Control Plane Servers

Servers that are part of the Control Plane (generally those that are not hosting Compute VMs or ephemeral storage) are replaced "in-place". This means the replacement server has the same IP Address and is expected to have the same NIC Mapping and Server Group as the server being replaced.

To replace a Control Plane server, click the menu to the right of the server listing on the Summary tab of the Section 3.2.6, “Servers” page. From the menu options, select Replace.

Replace Server Menu
Figure 3.48: Replace Server Menu

Selecting Replace will open a dialog box that includes information about the server being replaced, as well as a form for inputting the required information for the new server.

Replace Controller Form
Figure 3.49: Replace Controller Form

The IPMI information for the new server is required to perform the replacement process.

MAC Address

The hardware address of the server's primary physical ethernet adapter

IPMI IP Address

The network address for IPMI access to the new server

IPMI Username

The username credential for IPMI access to the new server

IPMI Password

The password associated with the IPMI Username on the new server

To use a server that has already been discovered, check the box for Use available servers and select an existing server from the Available Servers dropdown. This will automatically populate the server information fields above with the information previously entered/discovered for the specified server.

If SLES is not already installed, or to reinstall SLES on the new server, check the box for Install OS. The username will be pre-populated with the username from the Cloud install. Installing the OS requires specifying the password that was used for deploying the cloud so that the replacement process can access the host after the OS is installed.

The data disks on the new server will not be wiped by default, but users can specify to wipe clean the data disks as part of the replacement process.

Once the new server information is set, click the Replace button in the lower right to begin replacement. A list of the replacement process steps will be displayed, and there will be a link at the bottom of the list to show the log file as the changes are made.

Replace Controller Progress
Figure 3.50: Replace Controller Progress

When all of the steps are complete, click Close to return to the Servers page.

3.5.2 Compute Servers

When servers that host VMs are replaced, the following actions happen:

  1. a new server is added

  2. existing instances are migrated from the existing server to the new server

  3. the existing server is deleted from the model

The new server will not have the same IP Address and may have a different NIC Mapping and Server Group than the server being replaced.

To replace a Compute server, click the menu to the right of the server listing on the Summary tab of the Section 3.2.6, “Servers” page. From the menu options, select Replace.

Replace Compute Menu
Figure 3.51: Replace Compute Menu

Selecting Replace will open a dialog box that includes information about the server being replaced, and a form for inputting the required information for the new server.

If the IP address of the server being replaced cannot be reached by the deployer, a warning will appear to verify that the replacement should continue.

Unreachable Compute Node Warning
Figure 3.52: Unreachable Compute Node Warning
Replace Compute Form
Figure 3.53: Replace Compute Form

Replacing a Compute server involves adding the new server and then performing migration. This requires some new information:

  • an unused IP address

  • a new ID

  • selections for Server Group and NIC Mapping, which do not need to match the original server.

ID

This is the ID of the server in the data model. This does not necessarily correspond to any DNS or other naming labels of a host, unless the host ID was set that way during install.

IP Address

The management network IP address of the server

Server Group

The server group which this server is assigned to. If the required Server Group does not exist, it can be created

NIC Mapping

The NIC mapping that describes the PCI slot addresses for the server's ethernet adapters. If the required NIC mapping does not exist, it can be created

The IPMI information for the new server is also required to perform the replacement process.

Mac Address

The hardware address of the server's primary physical ethernet adapter

IPMI IP Address

The network address for IPMI access to the new server

IPMI Username

The username credential for IPMI access to the new server

IPMI Password

The password associated with the IPMI Username

To use a server that has already been discovered, check the box for Use available servers and select an existing server from the Available Servers dropdown. This will automatically populate the server information fields above with the information previously entered/discovered for the specified server.

If SLES is not already installed, or to reinstall SLES on the new server, check the box for Install OS. The username will be pre-populated with the username from the Cloud install. Installing the OS requires specifying the password that was used for deploying the cloud so that the replacement process can access the host after the OS is installed.

The data disks on the new server will not be wiped by default, but wipe clean can specified for the data disks as part of the replacement process.

When the new server information is set, click the Replace button in the lower right to begin replacement. The configuration processor will be run to validate that the entered information is compatible with the configuration of the Cloud.

When validation has completed, the Compute replacement takes place in several distinct steps, and each will have its own page with a list of process steps displayed. A link at the bottom of the list can show the log file as the changes are made.

  1. Install SLES if that option was selected

    Install SLES on New Compute
    Figure 3.54: Install SLES on New Compute
  2. Commit the changes to the data model and run the configuration processor

    Prepare Compute Server
    Figure 3.55: Prepare Compute Server
  3. Deploy the new server, install services on it, update monasca (if installed), activate the server with nova so that it can host VMs.

    Deploy New Compute Server
    Figure 3.56: Deploy New Compute Server
  4. Disable the existing server. If the existing server is unreachable, there may be warnings about disabling services on that server.

    Host Aggregate Removal Warning
    Figure 3.57: Host Aggregate Removal Warning

    If the existing server is reachable, instances on that server will be migrated to the new server.

    Migrate Instances from Existing Compute Server
    Figure 3.58: Migrate Instances from Existing Compute Server

    If the existing server is not reachable, the migration step will be skipped.

    Disable Existing Compute Server
    Figure 3.59: Disable Existing Compute Server
  5. Remove the existing server from the model and update the cloud configuration. If the server is not reachable, the user is asked to verify that the server is shut down. If server is reachable, the cloud services running on it will be stopped and the server will be shut down as part of the removal from the Cloud.

    Existing Server Shutdown Check
    Figure 3.60: Existing Server Shutdown Check

    Upon verification that the unreachable host is shut down, it will be removed from the data model.

    Existing Server Delete
    Figure 3.61: Existing Server Delete

    After the model has been updated, a summary of the changes will appear. Click Close to return to the server summary screen.

    Compute Replacement Summary
    Figure 3.62: Compute Replacement Summary

4 Third-Party Integrations

4.1 Splunk Integration

This documentation demonstrates the possible integration between the SUSE OpenStack Cloud 9 centralized logging solution and Splunk including the steps to set up and forward logs.

The SUSE OpenStack Cloud 9 logging solution provides a flexible and extensible framework to centralize the collection and processing of logs from all of the nodes in a cloud. The logs are shipped to a highly available and fault tolerant cluster where they are transformed and stored for better searching and reporting. The SUSE OpenStack Cloud 9 logging solution uses the ELK stack (Elasticsearch, Logstash and Kibana) as a production grade implementation and can support other storage and indexing technologies. The Logstash pipeline can be configured to forward the logs to an alternative target if you wish.

This documentation demonstrates the possible integration between the SUSE OpenStack Cloud 9 centralized logging solution and Splunk including the steps to set up and forward logs.

4.1.1 What is Splunk?

Splunk is software for searching, monitoring, and analyzing machine-generated big data, via a web-style interface. Splunk captures, indexes and correlates real-time data in a searchable repository from which it can generate graphs, reports, alerts, dashboards and visualizations. It is commercial software (unlike Elasticsearch) and more details about Splunk can be found at https://www.splunk.com.

4.1.2 Configuring Splunk to receive log messages from SUSE OpenStack Cloud 9

This documentation assumes that you already have Splunk set up and running. For help with installing and setting up Splunk, refer to Splunk Tutorial.

There are different ways in which a log message (or "event" in Splunk's terminology) can be sent to Splunk. These steps will set up a TCP port where Splunk will listen for messages.

  1. On the Splunk web UI, click on the Settings menu in the upper right-hand corner.

  2. In the Data section of the Settings menu, click Data Inputs.

  3. Choose the TCP option.

  4. Click the New button to add an input.

  5. In the Port field, enter the port number you want to use.

    Note
    Note

    If you are on a less secure network and want to restrict connections to this port, use the Only accept connection from field to restrict the traffic to a specific IP address.

  6. Click the Next button.

  7. Specify the Source Type by clicking on the Select button and choosing linux_messages_syslog from the list.

  8. Click the Review button.

  9. Review the configuration and click the Submit button.

  10. A success message will be displayed.

4.1.3 Forwarding log messages from SUSE OpenStack Cloud 9 Centralized Logging to Splunk

When you have Splunk set up and configured to receive log messages, you can configure SUSE OpenStack Cloud 9 to forward the logs to Splunk.

  1. Log in to the Cloud Lifecycle Manager.

  2. Check the status of the logging service:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts logging-status.yml

    If everything is up and running, continue to the next step.

  3. Edit the logstash config file at the location below:

    ~/openstack/ardana/ansible/roles/logging-server/templates/logstash.conf.j2

    At the bottom of the file will be a section for the Logstash outputs. Add details about your Splunk environment details.

    Below is an example, showing the placement in bold:

    # Logstash outputs
    #------------------------------------------------------------------------------
    output {
      # Configure Elasticsearch output
      # http://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html
      elasticsearch {
        index => %{[@metadata][es_index]}
        hosts => ["{{ elasticsearch_http_host }}:{{ elasticsearch_http_port }}"]
        flush_size => {{ logstash_flush_size }}
        idle_flush_time => 5
        workers => {{ logstash_threads }}
      }
       # Forward Logs to Splunk on the TCP port that matches the one specified in Splunk Web UI.
     tcp {
       mode => "client"
       host => "<Enter Splunk listener IP address>"
       port => TCP_PORT_NUMBER
     }
    }
    Note
    Note

    If you are not planning on using the Splunk UI to parse your centralized logs, there is no need to forward your logs to Elasticsearch. In this situation, comment out the lines in the Logstash outputs pertaining to Elasticsearch. However, you can continue to forward your centralized logs to multiple locations.

  4. Commit your changes to git:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "Logstash configuration change for Splunk
    integration"
  5. Run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  6. Update your deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost
    ready-deployment.yml
  7. Complete this change with a reconfigure of the logging environment:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts kronos-reconfigure.yml
  8. In your Splunk UI, confirm that the logs have begun to forward.

4.1.4 Searching for log messages from the Spunk dashboard

To both verify that your integration worked and to search your log messages that have been forwarded you can navigate back to your Splunk dashboard. In the search field, use this string:

source="tcp:TCP_PORT_NUMBER"

Find information on using the Splunk search tool at http://docs.splunk.com/Documentation/Splunk/6.4.3/SearchTutorial/WelcometotheSearchTutorial.

4.2 Operations Bridge Integration

The SUSE OpenStack Cloud 9 monitoring solution (monasca) can easily be integrated with your existing monitoring tools. Integrating SUSE OpenStack Cloud 9 monasca with Operations Bridge using the Operations Bridge Connector simplifies monitoring and managing events and topology information.

The integration provides the following functionality:

  • Forwarding of SUSE OpenStack Cloud monasca alerts and topology to Operations Bridge for event correlation

  • Customization of forwarded events and topology

For more information about this connector please see https://software.microfocus.com/en-us/products/operations-bridge-suite/overview.

4.3 Monitoring Third-Party Components With Monasca

4.3.1 monasca Monitoring Integration Overview

monasca, the SUSE OpenStack Cloud 9 monitoring service, collects information about your cloud's systems, and allows you to create alarm definitions based on these measurements. monasca-agent is the component that collects metrics such as metric storage and alarm thresholding and forwards them to the monasca-api for further processing.

With a small amount of configuration, you can use the detection and check plugins that are provided with your cloud to monitor integrated third-party components. In addition, you can write custom plugins and integrate them with the existing monitoring service.

Find instructions for customizing existing plugins to monitor third-party components in the Section 4.3.4, “Configuring Check Plugins”.

Find instructions for installing and configuring new custom plugins in the Section 4.3.3, “Writing Custom Plugins”.

You can also use existing alarm definitions, as well as create new alarm definitions that relate to a custom plugin or metric. Instructions for defining new alarm definitions are in the Section 4.3.6, “Configuring Alarm Definitions”.

You can use the Operations Console and monasca CLI to list all of the alarms, alarm-definitions, and metrics that exist on your cloud.

4.3.2 monasca Agent

The monasca agent (monasca-agent) collects information about your cloud using the installed plugins. The plugins are written in Python, and determine the monitoring metrics for your system, as well as the interval for collection. The default collection interval is 30 seconds, and we strongly recommend not changing this default value.

The following two types of custom plugins can be added to your cloud.

  • Detection Plugin. Determines whether the monasca-agent has the ability to monitor the specified component or service on a host. If successful, this type of plugin configures an associated check plugin by creating a YAML configuration file.

  • Check Plugin. Specifies the metrics to be monitored, using the configuration file created by the detection plugin.

monasca-agent is installed on every server in your cloud, and provides plugins that monitor the following.

  • System metrics relating to CPU, memory, disks, host availability, etc.

  • Process health metrics (process, http_check)

  • SUSE OpenStack Cloud 9-specific component metrics, such as apache rabbitmq, kafka, cassandra, etc.

monasca is pre-configured with default check plugins and associated detection plugins. The default plugins can be reconfigured to monitor third-party components, and often only require small adjustments to adapt them to this purpose. Find a list of the default plugins here: https://github.com/openstack/monasca-agent/blob/master/docs/Plugins.md#detection-plugins

Often, a single check plugin will be used to monitor multiple services. For example, many services use the http_check.py detection plugin to detect the up/down status of a service endpoint. Often the process.py check plugin, which provides process monitoring metrics, is used as a basis for a custom process detection plugin.

More information about the monasca agent can be found in the following locations

4.3.3 Writing Custom Plugins

When the pre-built monasca plugins do not meet your monitoring needs, you can write custom plugins to monitor your cloud. After you have written a plugin, you must install and configure it.

When your needs dictate a very specific custom monitoring check, you must provide both a detection and check plugin.

The steps involved in configuring a custom plugin include running a detection plugin and passing any necesssary parameters to the detection plugin so the resulting check configuration file is created with all necessary data.

When using an existing check plugin to monitor a third-party component, a custom detection plugin is needed only if there is not an associated default detection plugin.

Check plugin configuration files

Each plugin needs a corresponding YAML configuration file with the same stem name as the plugin check file. For example, the plugin file http_check.py (in /usr/lib/python2.7/site-packages/monasca_agent/collector/checks_d/) should have a corresponding configuration file, http_check.yaml (in /etc/monasca/agent/conf.d/http_check.yaml). The stem name http_check must be the same for both files.

Permissions for the YAML configuration file must be read+write for mon-agent user (the user that must also own the file), and read for the mon-agent group. Permissions for the file must be restricted to the mon-agent user and monasca group. The following example shows correct permissions settings for the file http_check.yaml.

ardana > ls -alt /etc/monasca/agent/conf.d/http_check.yaml
-rw-r----- 1 monasca-agent monasca 10590 Jul 26 05:44 http_check.yaml

A check plugin YAML configuration file has the following structure.

init_config:
    key1: value1
    key2: value2

instances:
    - name: john_smith
      username: john_smith
      password: 123456
    - name: jane_smith
      username: jane_smith
      password: 789012

In the above file structure, the init_config section allows you to specify any number of global key:value pairs. Each pair will be available on every run of the check that relates to the YAML configuration file.

The instances section allows you to list the instances that the related check will be run on. The check will be run once on each instance listed in the instances section. Ensure that each instance listed in the instances section has a unique name.

Custom detection plugins

Detection plugins should be written to perform checks that ensure that a component can be monitored on a host. Any arguments needed by the associated check plugin are passed into the detection plugin at setup (configuration) time. The detection plugin will write to the associated check configuration file.

When a detection plugin is successfully run in the configuration step, it will write to the check configuration YAML file. The configuration file for the check is written to the following directory.

/etc/monasca/agent/conf.d/

Writing process detection plugin using the ServicePlugin class

The monasca-agent provides a ServicePlugin class that makes process detection monitoring easy.

Process check

The process check plugin generates metrics based on the process status for specified process names. It generates process.pid_count metrics for the specified dimensions, and a set of detailed process metrics for the specified dimensions by default.

The ServicePlugin class allows you to specify a list of process name(s) to detect, and uses psutil to see if the process exists on the host. It then appends the process.yml configuration file with the process name(s), if they do not already exist.

The following is an example of a process.py check ServicePlugin.

import monasca_setup.detection

class monascaTransformDetect(monasca_setup.detection.ServicePlugin):
    """Detect monasca Transform daemons and setup configuration to monitor them."""
    def __init__(self, template_dir, overwrite=False, args=None):
        log.info("      Watching the monasca transform processes.")
        service_params = {
            'args': {},
            'template_dir': template_dir,
            'overwrite': overwrite,
            'service_name': 'monasca-transform',
            'process_names': ['monasca-transform','pyspark',
                              'transform/lib/driver']
        }
        super(monascaTransformDetect, self).__init__(service_params)

Writing a Custom Detection Plugin using Plugin or ArgsPlugin classes

A custom detection plugin class should derive from either the Plugin or ArgsPlugin classes provided in the /usr/lib/python2.7/site-packages/monasca_setup/detection directory.

If the plugin parses command line arguments, the ArgsPlugin class is useful. The ArgsPlugin class derives from the Plugin class. The ArgsPlugin class has a method to check for required arguments, and a method to return the instance that will be used for writing to the configuration file with the dimensions from the command line parsed and included.

If the ArgsPlugin methods do not seem to apply, then derive directly from the Plugin class.

When deriving from these classes, the following methods should be implemented.

  • _detect - set self.available=True when conditions are met that the thing to monitor exists on a host.

  • build_config - writes the instance information to the configuration and return the configuration.

  • dependencies_installed (default implementation is in ArgsPlugin, but not Plugin) - return true when python dependent libraries are installed.

The following is an example custom detection plugin.

import ast
import logging

import monasca_setup.agent_config
import monasca_setup.detection

log = logging.getLogger(__name__)


class HttpCheck(monasca_setup.detection.ArgsPlugin):
    """Setup an http_check according to the passed in args.
       Despite being a detection plugin this plugin does no detection and will be a noop without   arguments.
       Expects space separated arguments, the required argument is url. Optional parameters include:
       disable_ssl_validation and match_pattern.
    """

    def _detect(self):
        """Run detection, set self.available True if the service is detected.
        """
        self.available = self._check_required_args(['url'])

    def build_config(self):
        """Build the config as a Plugins object and return.
        """
        config = monasca_setup.agent_config.Plugins()
        # No support for setting headers at this time
        instance = self._build_instance(['url', 'timeout', 'username', 'password',
                                         'match_pattern', 'disable_ssl_validation',
                                         'name', 'use_keystone', 'collect_response_time'])

        # Normalize any boolean parameters
        for param in ['use_keystone', 'collect_response_time']:
            if param in self.args:
                instance[param] = ast.literal_eval(self.args[param].capitalize())
        # Set some defaults
        if 'collect_response_time' not in instance:
            instance['collect_response_time'] = True
        if 'name' not in instance:
            instance['name'] = self.args['url']

        config['http_check'] = {'init_config': None, 'instances': [instance]}

        return config

Installing a detection plugin in the OpenStack version delivered with SUSE OpenStack Cloud

Install a plugin by copying it to the plugin directory (/usr/lib/python2.7/site-packages/monasca_agent/collector/checks_d/).

The plugin should have file permissions of read+write for the root user (the user that should also own the file) and read for the root group and all other users.

The following is an example of correct file permissions for the http_check.py file.

-rw-r--r-- 1 root root 1769 Sep 19 20:14 http_check.py

Detection plugins should be placed in the following directory.

/usr/lib/monasca/agent/custom_detect.d/

The detection plugin directory name should be accessed using the monasca_agent_detection_plugin_dir Ansible variable. This variable is defined in the roles/monasca-agent/vars/main.yml file.

monasca_agent_detection_plugin_dir: /usr/lib/monasca/agent/custom_detect.d/

Example: Add Ansible monasca_configure task to install the plugin. (The monasca_configure task can be added to any service playbook.) In this example, it is added to ~/openstack/ardana/ansible/roles/_CEI-CMN/tasks/monasca_configure.yml.

---
- name: _CEI-CMN | monasca_configure |
    Copy ceilometer Custom plugin
  become: yes
  copy:
    src: ardanaceilometer_mon_plugin.py
    dest: "{{ monasca_agent_detection_plugin_dir }}"
    owner: root
    group: root
    mode: 0440

Custom check plugins

Custom check plugins generate metrics. Scalability should be taken into consideration on systems that will have hundreds of servers, as a large number of metrics can affect performance by impacting disk performance, RAM and CPU usage.

You may want to tune your configuration parameters so that less-important metrics are not monitored as frequently. When check plugins are configured (when they have an associated YAML configuration file) the agent will attempt to run them.

Checks should be able to run within the 30-second metric collection window. If your check runs a command, you should provide a timeout to prevent the check from running longer than the default 30-second window. You can use the monasca_agent.common.util.timeout_command to set a timeout for in your custom check plugin python code.

Find a description of how to write custom check plugins at https://github.com/openstack/monasca-agent/blob/master/docs/Customizations.md#creating-a-custom-check-plugin

Custom checks derive from the AgentCheck class located in the monasca_agent/collector/checks/check.py file. A check method is required.

Metrics should contain dimensions that make each item that you are monitoring unique (such as service, component, hostname). The hostname dimension is defined by default within the AgentCheck class, so every metric has this dimension.

A custom check will do the following.

  • Read the configuration instance passed into the check method.

  • Set dimensions that will be included in the metric.

  • Create the metric with gauge, rate, or counter types.

Metric Types:

  • gauge: Instantaneous reading of a particular value (for example, mem.free_mb).

  • rate: Measurement over a time period. The following equation can be used to define rate.

    rate=delta_v/float(delta_t)
  • counter: The number of events, increment and decrement methods, for example, zookeeper.timeouts

The following is an example component check named SimpleCassandraExample.

import monasca_agent.collector.checks as checks
from monasca_agent.common.util import timeout_command

CASSANDRA_VERSION_QUERY = "SELECT version();"


class SimpleCassandraExample(checks.AgentCheck):

    def __init__(self, name, init_config, agent_config):
        super(SimpleCassandraExample, self).__init__(name, init_config, agent_config)

    @staticmethod
    def _get_config(instance):
        user = instance.get('user')
        password = instance.get('password')
        service = instance.get('service')
        timeout = int(instance.get('timeout'))

        return user, password, service, timeout

    def check(self, instance):
        user, password, service, node_name, timeout = self._get_config(instance)

        dimensions = self._set_dimensions({'component': 'cassandra', 'service': service}, instance)

        results, connection_status = self._query_database(user, password, timeout, CASSANDRA_VERSION_QUERY)

        if connection_status != 0:
            self.gauge('cassandra.connection_status', 1, dimensions=dimensions)
        else:
            # successful connection status
            self.gauge('cassandra.connection_status', 0, dimensions=dimensions)

    def _query_database(self, user, password, timeout, query):
        stdout, stderr, return_code = timeout_command(["/opt/cassandra/bin/vsql", "-U", user, "-w", password, "-A", "-R",
                                                       "|", "-t", "-F", ",", "-x"], timeout, command_input=query)
        if return_code == 0:
            # remove trailing newline
            stdout = stdout.rstrip()
            return stdout, 0
        else:
            self.log.error("Error querying cassandra with return code of {0} and error {1}".format(return_code, stderr))
            return stderr, 1

Installing check plugin

The check plugin needs to have the same file permissions as the detection plugin. File permissions must be read+write for the root user (the user that should own the file), and read for the root group and all other users.

Check plugins should be placed in the following directory.

/usr/lib/monasca/agent/custom_checks.d/

The check plugin directory should be accessed using the monasca_agent_check_plugin_dir Ansible variable. This variable is defined in the roles/monasca-agent/vars/main.yml file.

monasca_agent_check_plugin_dir: /usr/lib/monasca/agent/custom_checks.d/

4.3.4 Configuring Check Plugins

Manually configure a plugin when unit-testing using the monasca-setup script installed with the monasca-agent

Find a good explanation of configuring plugins here: https://github.com/openstack/monasca-agent/blob/master/docs/Agent.md#configuring

SSH to a node that has both the monasca-agent installed as well as the component you wish to monitor.

The following is an example command that configures a plugin that has no parameters (uses the detection plugin class name).

root # /usr/bin/monasca-setup -d ARDANACeilometer

The following is an example command that configures the apache plugin and includes related parameters.

root # /usr/bin/monasca-setup -d apache -a 'url=http://192.168.245.3:9095/server-status?auto'

If there is a change in the configuration it will restart the monasca-agent on the host so the configuration is loaded.

After the plugin is configured, you can verify that the configuration file has your changes (see the next Verify that your check plugin is configured section).

Use the monasca CLI to see if your metric exists (see the Verify that metrics exist section).

Using Ansible modules to configure plugins in SUSE OpenStack Cloud 9

The monasca_agent_plugin module is installed as part of the monasca-agent role.

The following Ansible example configures the process.py plugin for the ceilometer detection plugin. The following example only passes in the name of the detection class.

- name: _CEI-CMN | monasca_configure |
    Run monasca agent Cloud Lifecycle Manager specific ceilometer detection plugin
  become: yes
  monasca_agent_plugin:
    name: "ARDANACeilometer"

If a password or other sensitive data are passed to the detection plugin, the no_log option should be set to True. If the no_log option is not set to True, the data passed to the plugin will be logged to syslog.

The following Ansible example configures the Cassandra plugin and passes in related arguments.

 - name: Run monasca Agent detection plugin for Cassandra
   monasca_agent_plugin:
     name: "Cassandra"
     args="directory_names={{ FND_CDB.vars.cassandra_data_dir }},{{ FND_CDB.vars.cassandra_commit_log_dir }} process_username={{ FND_CDB.vars.cassandra_user }}"
   when: database_type == 'cassandra'

The following Ansible example configures the keystone endpoint using the http_check.py detection plugin. The class name httpcheck of the http_check.py detection plugin is the name.

root # - name:  keystone-monitor | local_monitor |
    Setup active check on keystone internal endpoint locally
  become: yes
  monasca_agent_plugin:
    name: "httpcheck"
    args: "use_keystone=False \
           url=http://{{ keystone_internal_listen_ip }}:{{
               keystone_internal_port }}/v3 \
           dimensions=service:identity-service,\
                       component:keystone-api,\
                       api_endpoint:internal,\
                       monitored_host_type:instance"
  tags:
    - keystone
    - keystone_monitor

Verify that your check plugin is configured

All check configuration files are located in the following directory. You can see the plugins that are running by looking at the plugin configuration directory.

/etc/monasca/agent/conf.d/

When the monasca-agent starts up, all of the check plugins that have a matching configuration file in the /etc/monasca/agent/conf.d/ directory will be loaded.

If there are errors running the check plugin they will be written to the following error log file.

/var/log/monasca/agent/collector.log

You can change the monasca-agent log level by modifying the log_level option in the /etc/monasca/agent/agent.yaml configuration file, and then restarting the monasca-agent, using the following command.

root # service openstack-monasca-agent restart

You can debug a check plugin by running monasca-collector with the check option. The following is an example of the monasca-collector command.

tux > sudo /usr/bin/monasca-collector check CHECK_NAME

Verify that metrics exist

Begin by logging in to your deployer or controller node.

Run the following set of commands, including the monasca metric-list command. If the metric exists, it will be displayed in the output.

ardana > source ~/service.osrc
ardana > monasca metric-list --name METRIC_NAME

4.3.5 Metric Performance Considerations

Collecting metrics on your virtual machines can greatly affect performance. SUSE OpenStack Cloud 9 supports 200 compute nodes, with up to 40 VMs each. If your environment is managing maximum number of VMs, adding a single metric for all VMs is the equivalent of adding 8000 metrics.

Because of the potential impact that new metrics have on system performance, consider adding only new metrics that are useful for alarm-definition, capacity planning, or debugging process failure.

4.3.6 Configuring Alarm Definitions

The monasca-api-spec, found here https://github.com/openstack/monasca-api/blob/master/docs/monasca-api-spec.md provides an explanation of Alarm Definitions and Alarms. You can find more information on alarm definition expressions at the following page: https://github.com/openstack/monasca-api/blob/master/docs/monasca-api-spec.md#alarm-definition-expressions.

When an alarm definition is defined, the monasca-threshold engine will generate an alarm for each unique instance of the match_by metric dimensions found in the metric. This allows a single alarm definition that can dynamically handle the addition of new hosts.

There are default alarm definitions configured for all "process check" (process.py check) and "HTTP Status" (http_check.py check) metrics in the monasca-default-alarms role. The monasca-default-alarms role is installed as part of the monasca deployment phase of your cloud's deployment. You do not need to create alarm definitions for these existing checks.

Third parties should create an alarm definition when they wish to alarm on a custom plugin metric. The alarm definition should only be defined once. Setting a notification method for the alarm definition is recommended but not required.

The following Ansible modules used for alarm definitions are installed as part of the monasca-alarm-definition role. This process takes place during the monasca set up phase of your cloud's deployment.

  • monasca_alarm_definition

  • monasca_notification_method

The following examples, found in the ~/openstack/ardana/ansible/roles/monasca-default-alarms directory, illustrate how monasca sets up the default alarm definitions.

monasca Notification Methods

The monasca-api-spec, found in the following link, provides details about creating a notification https://github.com/openstack/monasca-api/blob/master/docs/monasca-api-spec.md#create-notification-method

The following are supported notification types.

  • EMAIL

  • WEBHOOK

  • PAGERDUTY

The keystone_admin_tenant project is used so that the alarms will show up on the Operations Console UI.

The following file snippet shows variables from the ~/openstack/ardana/ansible/roles/monasca-default-alarms/defaults/main.yml file.

---
notification_address: root@localhost
notification_name: 'Default Email'
notification_type: EMAIL

monasca_keystone_url: "{{ KEY_API.advertises.vips.private[0].url }}/v3"
monasca_api_url: "{{ MON_AGN.consumes_MON_API.vips.private[0].url }}/v2.0"
monasca_keystone_user: "{{ MON_API.consumes_KEY_API.vars.keystone_monasca_user }}"
monasca_keystone_password: "{{ MON_API.consumes_KEY_API.vars.keystone_monasca_password | quote }}"
monasca_keystone_project: "{{ KEY_API.vars.keystone_admin_tenant }}"

monasca_client_retries: 3
monasca_client_retry_delay: 2

You can specify a single default notification method in the ~/openstack/ardana/ansible/roles/monasca-default-alarms/tasks/main.yml file. You can also add or modify the notification type and related details using the Operations Console UI or monasca CLI.

The following is a code snippet from the ~/openstack/ardana/ansible/roles/monasca-default-alarms/tasks/main.yml file.

---
- name: monasca-default-alarms | main | Setup default notification method
  monasca_notification_method:
    name: "{{ notification_name }}"
    type: "{{ notification_type }}"
    address: "{{ notification_address }}"
    keystone_url: "{{ monasca_keystone_url }}"
    keystone_user: "{{ monasca_keystone_user }}"
    keystone_password: "{{ monasca_keystone_password }}"
    keystone_project: "{{ monasca_keystone_project }}"
    monasca_api_url: "{{ monasca_api_url }}"
  no_log: True
  tags:
    - system_alarms
    - monasca_alarms
    - openstack_alarms
  register: default_notification_result
  until: not default_notification_result | failed
  retries: "{{ monasca_client_retries }}"
  delay: "{{ monasca_client_retry_delay }}"

monasca Alarm Definition

In the alarm definition "expression" field, you can specify the metric name and threshold. The "match_by" field is used to create a new alarm for every unique combination of the match_by metric dimensions.

Find more details on alarm definitions at the monasca API documentation: (https://github.com/stackforge/monasca-api/blob/master/docs/monasca-api-spec.md#alarm-definitions-and-alarms).

The following is a code snippet from the ~/openstack/ardana/ansible/roles/monasca-default-alarms/tasks/main.yml file.

- name: monasca-default-alarms | main | Create Alarm Definitions
  monasca_alarm_definition:
    name: "{{ item.name }}"
    description: "{{ item.description | default('') }}"
    expression: "{{ item.expression }}"
    keystone_token: "{{ default_notification_result.keystone_token }}"
    match_by: "{{ item.match_by | default(['hostname']) }}"
    monasca_api_url: "{{ default_notification_result.monasca_api_url }}"
    severity: "{{ item.severity | default('LOW') }}"
    alarm_actions:
      - "{{ default_notification_result.notification_method_id }}"
    ok_actions:
      - "{{ default_notification_result.notification_method_id }}"
    undetermined_actions:
      - "{{ default_notification_result.notification_method_id }}"
  register: monasca_system_alarms_result
  until: not monasca_system_alarms_result | failed
  retries: "{{ monasca_client_retries }}"
  delay: "{{ monasca_client_retry_delay }}"
  with_flattened:
    - monasca_alarm_definitions_system
    - monasca_alarm_definitions_monasca
    - monasca_alarm_definitions_openstack
    - monasca_alarm_definitions_misc_services
  when: monasca_create_definitions

In the following example ~/openstack/ardana/ansible/roles/monasca-default-alarms/vars/main.yml Ansible variables file, the alarm definition named Process Check sets the match_by variable with the following parameters.

  • process_name

  • hostname

monasca_alarm_definitions_system:
  - name: "Host Status"
    description: "Alarms when the specified host is down or not reachable"
    severity: "HIGH"
    expression: "host_alive_status > 0"
    match_by:
      - "target_host"
      - "hostname"
  - name: "HTTP Status"
    description: >
      "Alarms when the specified HTTP endpoint is down or not reachable"
    severity: "HIGH"
    expression: "http_status > 0"
    match_by:
      - "service"
      - "component"
      - "hostname"
      - "url"
  - name: "CPU Usage"
    description: "Alarms when CPU usage is high"
    expression: "avg(cpu.idle_perc) < 10 times 3"
  - name: "High CPU IOWait"
    description: "Alarms when CPU IOWait is high, possible slow disk issue"
    expression: "avg(cpu.wait_perc) > 40 times 3"
    match_by:
      - "hostname"
  - name: "Disk Inode Usage"
    description: "Alarms when disk inode usage is high"
    expression: "disk.inode_used_perc > 90"
    match_by:
      - "hostname"
      - "device"
    severity: "HIGH"
  - name: "Disk Usage"
    description: "Alarms when disk usage is high"
    expression: "disk.space_used_perc > 90"
    match_by:
      - "hostname"
      - "device"
    severity: "HIGH"
  - name: "Memory Usage"
    description: "Alarms when memory usage is high"
    severity: "HIGH"
    expression: "avg(mem.usable_perc) < 10 times 3"
  - name: "Network Errors"
    description: >
      "Alarms when either incoming or outgoing network errors are high"
    severity: "MEDIUM"
    expression: "net.in_errors_sec > 5 or net.out_errors_sec > 5"
  - name: "Process Check"
    description: "Alarms when the specified process is not running"
    severity: "HIGH"
    expression: "process.pid_count < 1"
    match_by:
      - "process_name"
      - "hostname"
  - name: "Crash Dump Count"
    description: "Alarms when a crash directory is found"
    severity: "MEDIUM"
    expression: "crash.dump_count > 0"
    match_by:
      - "hostname"

The preceding configuration would result in the creation of an alarm for each unique metric that matched the following criteria.

process.pid_count + process_name + hostname

Check that the alarms exist

Begin by using the following commands, including monasca alarm-definition-list, to check that the alarm definition exists.

ardana > source ~/service.osrc
ardana > monasca alarm-definition-list --name ALARM_DEFINITION_NAME

Then use either of the following commands to check that the alarm has been generated. A status of "OK" indicates a healthy alarm.

ardana > monasca alarm-list --metric-name metric name

Or

ardana > monasca alarm-list --alarm-definition-id ID_FROM_ALARM-DEFINITION-LIST
Note
Note

To see CLI options use the monasca help command.

Alarm state upgrade considerations

If the name of a monitoring metric changes or is no longer being sent, existing alarms will show the alarm state as UNDETERMINED. You can update an alarm definition as long as you do not change the metric name or dimension name values in the expression or match_by fields. If you find that you need to alter either of these values, you must delete the old alarm definitions and create new definitions with the updated values.

If a metric is never sent, but has a related alarm definition, then no alarms would exist. If you find that metrics are never sent, then you should remove the related alarm definitions.

When removing an alarm definition, the Ansible module monasca_alarm_definition supports the state absent.

The following file snippet shows an example of how to remove an alarm definition by setting the state to absent.

- name: monasca-pre-upgrade | Remove alarm definitions
   monasca_alarm_definition:
     name: "{{ item.name }}"
     state: "absent"
     keystone_url: "{{ monasca_keystone_url }}"
     keystone_user: "{{ monasca_keystone_user }}"
     keystone_password: "{{ monasca_keystone_password }}"
     keystone_project: "{{ monasca_keystone_project }}"
     monasca_api_url: "{{ monasca_api_url }}"
   with_items:
     - { name: "Kafka Consumer Lag" }

An alarm exists in the OK state when the monasca threshold engine has seen at least one metric associated with the alarm definition and has not exceeded the alarm definition threshold.

4.3.7 Openstack Integration of Custom Plugins into monasca-Agent (if applicable)

monasca-agent is an OpenStack open-source project. monasca can also monitor non-openstack services. Third parties should install custom plugins into their SUSE OpenStack Cloud 9 system using the steps outlined in the Section 4.3.3, “Writing Custom Plugins”. If the OpenStack community determines that the custom plugins are of general benefit, the plugin may be added to the openstack/monasca-agent so that they are installed with the monasca-agent. During the review process for openstack/monasca-agent there are no guarantees that code will be approved or merged by a deadline. Open-source contributors are expected to help with codereviews in order to get their code accepted. Once changes are approved and integrated into the openstack/monasca-agent and that version of the monasca-agent is integrated with SUSE OpenStack Cloud 9, the third party can remove the custom plugin installation steps since they would be installed in the default monasca-agent venv.

Find the open source repository for the monaca-agent here: https://github.com/openstack/monasca-agent

5 Managing Identity

The Identity service provides the structure for user authentication to your cloud.

5.1 The Identity Service

This topic explains the purpose and mechanisms of the identity service.

The SUSE OpenStack Cloud Identity service, based on the OpenStack keystone API, is responsible for providing UserID authentication and access authorization to enable organizations to achieve their access security and compliance objectives and successfully deploy OpenStack. In short, the Identity service is the gateway to the rest of the OpenStack services.

5.1.1 Which version of the Identity service should you use?

Use Identity API version 3.0, as previous versions no longer exist as endpoints for Identity API queries.

Similarly, when performing queries, you must use the OpenStack CLI (the openstack command), and not the keystone CLI (keystone) as the latter is only compatible with API versions prior to 3.0.

5.1.2 Authentication

The authentication function provides the initial login function to OpenStack. keystone supports multiple sources of authentication, including a native or built-in authentication system. The keystone native system can be used for all user management functions for proof of concept deployments or small deployments not requiring integration with a corporate authentication system, but it lacks some of the advanced functions usually found in user management systems such as forcing password changes. The focus of the keystone native authentication system is to be the source of authentication for OpenStack-specific users required for the operation of the various OpenStack services. These users are stored by keystone in a default domain; the addition of these IDs to an external authentication system is not required.

keystone is more commonly integrated with external authentication systems such as OpenLDAP or Microsoft Active Directory. These systems are usually centrally deployed by organizations to serve as the single source of user management and authentication for all in-house deployed applications and systems requiring user authentication. In addition to LDAP and Microsoft Active Directory, support for integration with Security Assertion Markup Language (SAML)-based identity providers from companies such as Ping, CA, IBM, Oracle, and others is also nearly "production-ready".

keystone also provides architectural support via the underlying Apache deployment for other types of authentication systems such as Multi-Factor Authentication. These types of systems typically require driver support and integration from the respective provider vendors.

Note
Note

While support for Identity Providers and Multi-factor authentication is available in keystone, it has not yet been certified by the SUSE OpenStack Cloud engineering team and is an experimental feature in SUSE OpenStack Cloud.

LDAP-compatible directories such as OpenLDAP and Microsoft Active Directory are recommended alternatives to using the keystone local authentication. Both methods are widely used by organizations and are integrated with a variety of other enterprise applications. These directories act as the single source of user information within an organization. keystone can be configured to authenticate against an LDAP-compatible directory on a per-domain basis.

Domains, as explained in Section 5.3, “Understanding Domains, Projects, Users, Groups, and Roles”, can be configured so that based on the user ID, a incoming user is automatically mapped to a specific domain. This domain can then be configured to authenticate against a specific LDAP directory. The user credentials provided by the user to keystone are passed along to the designated LDAP source for authentication. This communication can be optionally configured to be secure via SSL encryption. No special LDAP administrative access is required, and only read-only access is needed for this configuration. keystone will not add any LDAP information. All user additions, deletions, and modifications are performed by the application's front end in the LDAP directories. After a user has been successfully authenticated, they are then assigned to the groups, roles, and projects defined by the keystone domain or project administrators. This information is stored within the keystone service database.

Another form of external authentication provided by the keystone service is via integration with SAML-based Identity Providers (IdP) such as Ping Identity, IBM Tivoli, and Microsoft Active Directory Federation Server. A SAML-based identity provider provides authentication that is often called "single sign-on". The IdP server is configured to authenticate against identity sources such as Active Directory and provides a single authentication API against multiple types of downstream identity sources. This means that an organization could have multiple identity storage sources but a single authentication source. In addition, if a user has logged into one such source during a defined session time frame, they do not need to re-authenticate within the defined session. Instead, the IdP will automatically validate the user to requesting applications and services.

A SAML-based IdP authentication source is configured with keystone on a per-domain basis similar to the manner in which native LDAP directories are configured. Extra mapping rules are required in the configuration that define which keystone group an incoming UID is automatically assigned to. This means that groups need to be defined in keystone first, but it also removes the requirement that a domain or project admin assign user roles and project membership on a per-user basis. Instead, groups are used to define project membership and roles and incoming users are automatically mapped to keystone groups based on their upstream group membership. This provides a very consistent role-based access control (RBAC) model based on the upstream identity source. The configuration of this option is fairly straightforward. IdP vendors such as Ping and IBM are contributing to the maintenance of this function and have also produced their own integration documentation. Microsoft Active Directory Federation Services (ADFS) is used for functional testing and future documentation.

In addition to SAML-based IdP, keystone also supports external authentication with a third party IdP using OpenID Connect protocol by leveraging the capabilities provided by the Apache2 auth_mod_openidc module. The configuration of OpenID Connect is similar to SAML.

The third keystone-supported authentication source is known as Multi-Factor Authentication (MFA). MFA typically requires an external source of authentication beyond a login name and password, and can include options such as SMS text, a temporal token generator, a fingerprint scanner, etc. Each of these types of MFA are usually specific to a particular MFA vendor. The keystone architecture supports an MFA-based authentication system, but this has not yet been certified or documented for SUSE OpenStack Cloud.

5.1.3 Authorization

The second major function provided by the keystone service is access authorization that determines what resources and actions are available based on the UserID, the role of the user, and the projects that a user is provided access to. All of this information is created, managed, and stored by keystone. These functions are applied via the horizon web interface, the OpenStack command-line interface, or the direct keystone API.

keystone provides support for organizing users via three entities including:

Domains

Domains provide the highest level of organization. Domains are intended to be used as high-level containers for multiple projects. A domain can represent different tenants, companies or organizations for an OpenStack cloud deployed for public cloud deployments or represent major business units, functions, or any other type of top-level organization unit in an OpenStack private cloud deployment. Each domain has at least one Domain Admin assigned to it. This Domain Admin can then create multiple projects within the domain and assign the project admin role to specific project owners. Each domain created in an OpenStack deployment is unique and the projects assigned to a domain cannot exist in another domain.

Projects

Projects are entities within a domain that represent groups of users, each user role within that project, and how many underlying infrastructure resources can be consumed by members of the project.

Groups

Groups are an optional function and provide the means of assigning project roles to multiple users at once.

keystone also provides the means to create and assign roles to groups of users or individual users. The role names are created and user assignments are made within keystone. The actual function of a role is defined currently per each OpenStack service via scripts. When a user requests access to an OpenStack service, his access token contains information about his assigned project membership and role for that project. This role is then matched to the service-specific script and the user is allowed to perform functions within that service defined by the role mapping.

5.2 Supported Upstream Keystone Features

5.2.1 OpenStack upstream features that are enabled by default in SUSE OpenStack Cloud 9

The following supported keystone features are enabled by default in the SUSE OpenStack Cloud 9 release.

NameUser/AdminNote: API support only. No CLI/UI support
Implied RolesAdminhttps://blueprints.launchpad.net/keystone/+spec/implied-roles
Domain-Specific RolesAdminhttps://blueprints.launchpad.net/keystone/+spec/domain-specific-roles
Fernet Token ProviderUser and Adminhttps://docs.openstack.org/keystone/rocky/admin/identity-fernet-token-faq.html

Implied rules

To allow for the practice of hierarchical permissions in user roles, this feature enables roles to be linked in such a way that they function as a hierarchy with role inheritance.

When a user is assigned a superior role, the user will also be assigned all roles implied by any subordinate roles. The hierarchy of the assigned roles will be expanded when issuing the user a token.

Domain-specific roles

This feature extends the principle of implied roles to include a set of roles that are specific to a domain. At the time a token is issued, the domain-specific roles are not included in the token, however, the roles that they map to are.

Fernet token provider

Provides tokens in the Fernet format. This feature is automatically configured and is enabled by default. Fernet tokens are preferred and used by default instead of the older UUID token format.

5.2.2 OpenStack upstream features that are disabled by default in SUSE OpenStack Cloud 9

The following is a list of features which are fully supported in the SUSE OpenStack Cloud 9 release, but are disabled by default. Customers can run a playbook to enable the features.

NameUser/AdminReason Disabled
Support multiple LDAP backends via per-domain configurationAdminNeeds explicit configuration.
WebSSOUser and AdminNeeds explicit configuration.
keystone-to-keystone (K2K) federationUser and AdminNeeds explicit configuration.
Domain-specific config in SQLAdminDomain specific configuration options can be stored in SQL instead of configuration files, using the new REST APIs.

Multiple LDAP backends for each domain

This feature allows identity backends to be configured on a domain-by-domain basis. Domains will be capable of having their own exclusive LDAP service (or multiple services). A single LDAP service can also serve multiple domains, with each domain in a separate subtree.

To implement this feature, individual domains will require domain-specific configuration files. Domains that do not implement this feature will continue to share a common backend driver.

WebSSO

This feature enables the keystone service to provide federated identity services through a token-based single sign-on page. This feature is disabled by default, as it requires explicit configuration.

keystone-to-keystone (K2K) federation

This feature enables separate keystone instances to federate identities among the instances, offering inter-cloud authorization. This feature is disabled by default, as it requires explicit configuration.

Domain-specific config in SQL

Using the new REST APIs, domain-specific configuration options can be stored in a SQL database instead of in configuration files.

5.2.3 Stack upstream features that have been specifically disabled in SUSE OpenStack Cloud 9

The following is a list of extensions which are disabled by default in SUSE OpenStack Cloud 9, according to keystone policy.

Target ReleaseNameUser/AdminReason Disabled
TBDEndpoint FilteringAdmin

This extension was implemented to facilitate service activation. However, due to lack of enforcement at the service side, this feature is only half effective right now.

TBDEndpoint PolicyAdmin

This extension was intended to facilitate policy (policy.json) management and enforcement. This feature is useless right now due to lack of the needed middleware to utilize the policy files stored in keystone.

TBDOATH 1.0aUser and Admin

Complexity in workflow. Lack of adoption. Its alternative, keystone Trust, is enabled by default. HEAT is using keystone Trust.

TBDRevocation EventsAdmin

For PKI token only and PKI token is disabled by default due to usability concerns.

TBDOS CERTAdmin

For PKI token only and PKI token is disabled by default due to usability concerns.

TBDPKI TokenAdmin

PKI token is disabled by default due to usability concerns.

TBDDriver level cachingAdmin

Driver level caching is disabled by default due to complexity in setup.

TBDTokenless AuthzAdmin

Tokenless authorization with X.509 SSL client certificate.

TBDTOTP AuthenticationUser

Not fully baked. Has not been battle-tested.

TBDis_admin_projectAdmin

No integration with the services.

5.3 Understanding Domains, Projects, Users, Groups, and Roles

The identity service uses these concepts for authentication within your cloud and these are descriptions of each of them.

The SUSE OpenStack Cloud 9 identity service uses OpenStack keystone and the concepts of domains, projects, users, groups, and roles to manage authentication. This page describes how these work together.

5.3.1 Domains, Projects, Users, Groups, and Roles

Most large business organizations use an identity system such as Microsoft Active Directory to store and manage their internal user information. A variety of applications such as HR systems are, in turn, used to manage the data inside of Active Directory. These same organizations often deploy a separate user management system for external users such as contractors, partners, and customers. Multiple authentication systems are then deployed to support multiple types of users.

An LDAP-compatible directory such as Active Directory provides a top-level organization or domain component. In this example, the organization is called Acme. The domain component (DC) is defined as acme.com. Underneath the top level domain component are entities referred to as organizational units (OU). Organizational units are typically designed to reflect the entity structure of the organization. For example, this particular schema has 3 different organizational units for the Marketing, IT, and Contractors units or departments of the Acme organization. Users (and other types of entities like printers) are then defined appropriately underneath each organizational entity. The keystone domain entity can be used to match the LDAP OU entity; each LDAP OU can have a corresponding keystone domain created. In this example, both the Marketing and IT domains represent internal employees of Acme and use the same authentication source. The Contractors domain contains all external people associated with Acme. UserIDs associated with the Contractor domain are maintained in a separate user directory and thus have a different authentication source assigned to the corresponding keystone-defined Contractors domain.

A public cloud deployment usually supports multiple, separate organizations. keystone domains can be created to provide a domain per organization with each domain configured to the underlying organization's authentication source. For example, the ABC company would have a keystone domain created called "abc". All users authenticating to the "abc" domain would be authenticated against the authentication system provided by the ABC organization; in this case ldap://ad.abc.com

5.3.2 Domains

A domain is a top-level container targeted at defining major organizational entities.

  • Domains can be used in a multi-tenant OpenStack deployment to segregate projects and users from different companies in a public cloud deployment or different organizational units in a private cloud setting.

  • Domains provide the means to identify multiple authentication sources.

  • Each domain is unique within an OpenStack implementation.

  • Multiple projects can be assigned to a domain but each project can only belong to a single domain.

  • Each domain has an assigned "admin".

  • Each project has an assigned "admin".

  • Domains are created by the "admin" service account and domain admins are assigned by the "admin" user.

  • The "admin" UserID (UID) is created during the keystone installation, has the "admin" role assigned to it, and is defined as the "Cloud Admin". This UID is created using the "magic" or "secret" admin token found in the default 'keystone.conf' file installed during SUSE OpenStack Cloud keystone installation after the keystone service has been installed. This secret token should be removed after installation and the "admin" password changed.

  • The "default" domain is created automatically during the SUSE OpenStack Cloud keystone installation.

  • The "default" domain contains all OpenStack service accounts that are installed during the SUSE OpenStack Cloud keystone installation process.

  • No users but the OpenStack service accounts should be assigned to the "default" domain.

  • Domain admins can be any UserID inside or outside of the domain.

5.3.3 Domain Administrator

A UUID is a domain administrator for a given domain if that UID has a domain-scoped token scoped for the given domain. This means that the UID has the "admin" role assigned to it for the selected domain.

  • The Cloud Admin UID assigns the domain administrator role for a domain to a selected UID.

  • A domain administrator can create and delete local users who have authenticated against keystone. These users will be assigned to the domain belonging to the domain administrator who creates the UserID.

  • A domain administrator can only create users and projects within her assigned domains.

  • A domain administrator can assign the "admin" role of their domains to another UID or revoke it; each UID with the "admin" role for a specified domain will be a co-administrator for that domain.

  • A UID can be assigned to be the domain admin of multiple domains.

  • A domain administrator can assign non-admin roles to any users and groups within their assigned domain, including projects owned by their assigned domain.

  • A domain admin UID can belong to projects within their administered domains.

  • Each domain can have a different authentication source.

  • The domain field is used during the initial login to define the source of authentication.

  • The "List Users" function can only be executed by a UID with the domain admin role.

  • A domain administrator can assign a UID from outside of their domain the "domain admin" role, but it is assumed that the domain admin would know the specific UID and would not need to list users from an external domain.

  • A domain administrator can assign a UID from outside of their domain the "project admin" role for a specific project within their domain, but it is assumed that the domain admin would know the specific UID and would not need to list users from an external domain.

  • Any user that needs the ability to create a user in a project should be granted the "admin" role for the domain where the user and the project reside.

  • In order for the horizon Compute › Images panel to properly fill the "Owner" column, any user that is granted the admin role on a project must also be granted the "member" or "admin" role in the domain.

5.3.4 Projects

The domain administrator creates projects within his assigned domain and assigns the project admin role to each project to a selected UID. A UID is a project administrator for a given project if that UID has a project-scoped token scoped for the given project. There can be multiple projects per domain. The project admin sets the project quota settings, adds/deletes users and groups to and from the project, and defines the user/group roles for the assigned project. Users can be belong to multiple projects and have different roles on each project. Users are assigned to a specific domain and a default project. Roles are assigned per project.

5.3.5 Users and Groups

Each user belongs to one domain only. Domain assignments are defined either by the domain configuration files or by a domain administrator when creating a new, local (user authenticated against keystone) user. There is no current method for "moving" a user from one domain to another. A user can belong to multiple projects within a domain with a different role assignment per project. A group is a collection of users. Users can be assigned to groups either by the project admin or automatically via mappings if an external authentication source is defined for the assigned domain. Groups can be assigned to multiple projects within a domain and have different roles assigned to the group per project. A group can be assigned the "admin" role for a domain or project. All members of the group will be an "admin" for the selected domain or project.

5.3.6 Roles

Service roles represent the functionality used to implement the OpenStack role based access control (RBAC), model used to manage access to each OpenStack service. Roles are named and assigned per user or group for each project by the identity service. Role definition and policy enforcement are defined outside of the identity service independently by each OpenStack service. The token generated by the identity service for each user authentication contains the role assigned to that user for a particular project. When a user attempts to access a specific OpenStack service, the role is parsed by the service, compared to the service-specific policy file, and then granted the resource access defined for that role by the service policy file.

Each service has its own service policy file with the /etc/[SERVICE_CODENAME]/policy.json file name format where [SERVICE_CODENAME] represents a specific OpenStack service name. For example, the OpenStack nova service would have a policy file called /etc/nova/policy.json. With Service policy files can be modified and deployed to control nodes from the Cloud Lifecycle Manager. Administrators are advised to validate policy changes before checking in the changes to the site branch of the local git repository before rolling the changes into production. Do not make changes to policy files without having a way to validate them.

The policy files are located at the following site branch locations on the Cloud Lifecycle Manager.

~/openstack/ardana/ansible/roles/GLA-API/templates/policy.json.j2
~/openstack/ardana/ansible/roles/ironic-common/files/policy.json
~/openstack/ardana/ansible/roles/KEYMGR-API/templates/policy.json
~/openstack/ardana/ansible/roles/heat-common/files/policy.json
~/openstack/ardana/ansible/roles/CND-API/templates/policy.json
~/openstack/ardana/ansible/roles/nova-common/files/policy.json
~/openstack/ardana/ansible/roles/CEI-API/templates/policy.json.j2
~/openstack/ardana/ansible/roles/neutron-common/templates/policy.json.j2

For test and validation, policy files can be modified in a non-production environment from the ~/scratch/ directory. For a specific policy file, run a search for policy.json. To deploy policy changes for a service, run the service specific reconfiguration playbook (for example, nova-reconfigure.yml). For a complete list of reconfiguration playbooks, change directories to ~/scratch/ansible/next/ardana/ansible and run this command:

ardana > ls | grep reconfigure

A read-only role named project_observer is explicitly created in SUSE OpenStack Cloud 9. Any user who is granted this role can use list_project.

5.4 Identity Service Token Validation Example

The following diagram illustrates the flow of typical Identity service (keystone) requests/responses between SUSE OpenStack Cloud services and the Identity service. It shows how keystone issues and validates tokens to ensure the identity of the caller of each service.

Image
  1. horizon sends an HTTP authentication request to keystone for user credentials.

  2. keystone validates the credentials and replies with token.

  3. horizon sends a POST request, with token to nova to start provisioning a virtual machine.

  4. nova sends token to keystone for validation.

  5. keystone validates the token.

  6. nova forwards a request for an image with the attached token.

  7. glance sends token to keystone for validation.

  8. keystone validates the token.

  9. glance provides image-related information to nova.

  10. nova sends request for networks to neutron with token.

  11. neutron sends token to keystone for validation.

  12. keystone validates the token.

  13. neutron provides network-related information to nova.

  14. nova reports the status of the virtual machine provisioning request.

5.5 Configuring the Identity Service

5.5.1 What is the Identity service?

The SUSE OpenStack Cloud Identity service, based on the OpenStack keystone API, provides UserID authentication and access authorization to help organizations achieve their access security and compliance objectives and successfully deploy OpenStack. In short, the Identity service is the gateway to the rest of the OpenStack services.

The identity service is installed automatically by the Cloud Lifecycle Manager (just after MySQL and RabbitMQ). When your cloud is up and running, you can customize keystone in a number of ways, including integrating with LDAP servers. This topic describes the default configuration. See Section 5.8, “Reconfiguring the Identity service” for changes you can implement. Also see Section 5.9, “Integrating LDAP with the Identity Service” for information on integrating with an LDAP provider.

5.5.2 Which version of the Identity service should you use?

Note that you should use identity API version 3.0. Identity API v2.0 was has been deprecated. Many features such as LDAP integration and fine-grained access control will not work with v2.0. The following are a few questions you may have regarding versions.

Why does the keystone identity catalog still show version 2.0?

Tempest tests still use the v2.0 API. They are in the process of migrating to v3.0. We will remove the v2.0 version once tempest has migrated the tests. The Identity catalog has version 2.0 just to support tempest migration.

Will the keystone identity v3.0 API work if the identity catalog has only the v2.0 endpoint?

Identity v3.0 does not rely on the content of the catalog. It will continue to work regardless of the version of the API in the catalog.

Which CLI client should you use?

You should use the OpenStack CLI, not the keystone CLI, because it is deprecated. The keystone CLI does not support the v3.0 API; only the OpenStack CLI supports the v3.0 API.

5.5.3 Authentication

The authentication function provides the initial login function to OpenStack. keystone supports multiple sources of authentication, including a native or built-in authentication system. You can use the keystone native system for all user management functions for proof-of-concept deployments or small deployments not requiring integration with a corporate authentication system, but it lacks some of the advanced functions usually found in user management systems such as forcing password changes. The focus of the keystone native authentication system is to be the source of authentication for OpenStack-specific users required to operate various OpenStack services. These users are stored by keystone in a default domain; the addition of these IDs to an external authentication system is not required.

keystone is more commonly integrated with external authentication systems such as OpenLDAP or Microsoft Active Directory. These systems are usually centrally deployed by organizations to serve as the single source of user management and authentication for all in-house deployed applications and systems requiring user authentication. In addition to LDAP and Microsoft Active Directory, support for integration with Security Assertion Markup Language (SAML)-based identity providers from companies such as Ping, CA, IBM, Oracle, and others is also nearly "production-ready."

keystone also provides architectural support through the underlying Apache deployment for other types of authentication systems, such as multi-factor authentication. These types of systems typically require driver support and integration from the respective providers.

Note
Note

While support for Identity providers and multi-factor authentication is available in keystone, it has not yet been certified by the SUSE OpenStack Cloud engineering team and is an experimental feature in SUSE OpenStack Cloud.

LDAP-compatible directories such as OpenLDAP and Microsoft Active Directory are recommended alternatives to using keystone local authentication. Both methods are widely used by organizations and are integrated with a variety of other enterprise applications. These directories act as the single source of user information within an organization. You can configure keystone to authenticate against an LDAP-compatible directory on a per-domain basis.

Domains, as explained in Section 5.3, “Understanding Domains, Projects, Users, Groups, and Roles”, can be configured so that, based on the user ID, an incoming user is automatically mapped to a specific domain. You can then configure this domain to authenticate against a specific LDAP directory. User credentials provided by the user to keystone are passed along to the designated LDAP source for authentication. You can optionally configure this communication to be secure through SSL encryption. No special LDAP administrative access is required, and only read-only access is needed for this configuration. keystone will not add any LDAP information. All user additions, deletions, and modifications are performed by the application's front end in the LDAP directories. After a user has been successfully authenticated, that user is then assigned to the groups, roles, and projects defined by the keystone domain or project administrators. This information is stored in the keystone service database.

Another form of external authentication provided by the keystone service is through integration with SAML-based identity providers (IdP) such as Ping Identity, IBM Tivoli, and Microsoft Active Directory Federation Server. A SAML-based identity provider provides authentication that is often called "single sign-on." The IdP server is configured to authenticate against identity sources such as Active Directory and provides a single authentication API against multiple types of downstream identity sources. This means that an organization could have multiple identity storage sources but a single authentication source. In addition, if a user has logged into one such source during a defined session time frame, that user does not need to reauthenticate within the defined session. Instead, the IdP automatically validates the user to requesting applications and services.

A SAML-based IdP authentication source is configured with keystone on a per-domain basis similar to the manner in which native LDAP directories are configured. Extra mapping rules are required in the configuration that define which keystone group an incoming UID is automatically assigned to. This means that groups need to be defined in keystone first, but it also removes the requirement that a domain or project administrator assign user roles and project membership on a per-user basis. Instead, groups are used to define project membership and roles and incoming users are automatically mapped to keystone groups based on their upstream group membership. This strategy provides a consistent role-based access control (RBAC) model based on the upstream identity source. The configuration of this option is fairly straightforward. IdP vendors such as Ping and IBM are contributing to the maintenance of this function and have also produced their own integration documentation. HPE is using the Microsoft Active Directory Federation Services (AD FS) for functional testing and future documentation.

The third keystone-supported authentication source is known as multi-factor authentication (MFA). MFA typically requires an external source of authentication beyond a login name and password, and can include options such as SMS text, a temporal token generator, or a fingerprint scanner. Each of these types of MFAs are usually specific to a particular MFA vendor. The keystone architecture supports an MFA-based authentication system, but this has not yet been certified or documented for SUSE OpenStack Cloud.

5.5.4 Authorization

Another major function provided by the keystone service is access authorization that determines which resources and actions are available based on the UserID, the role of the user, and the projects that a user is provided access to. All of this information is created, managed, and stored by keystone. These functions are applied through the horizon web interface, the OpenStack command-line interface, or the direct keystone API.

keystone provides support for organizing users by using three entities:

Domains

Domains provide the highest level of organization. Domains are intended to be used as high-level containers for multiple projects. A domain can represent different tenants, companies, or organizations for an OpenStack cloud deployed for public cloud deployments or it can represent major business units, functions, or any other type of top-level organization unit in an OpenStack private cloud deployment. Each domain has at least one Domain Admin assigned to it. This Domain Admin can then create multiple projects within the domain and assign the project administrator role to specific project owners. Each domain created in an OpenStack deployment is unique and the projects assigned to a domain cannot exist in another domain.

Projects

Projects are entities within a domain that represent groups of users, each user role within that project, and how many underlying infrastructure resources can be consumed by members of the project.

Groups

Groups are an optional function and provide the means of assigning project roles to multiple users at once.

keystone also makes it possible to create and assign roles to groups of users or individual users. Role names are created and user assignments are made within keystone. The actual function of a role is defined currently for each OpenStack service via scripts. When users request access to an OpenStack service, their access tokens contain information about their assigned project membership and role for that project. This role is then matched to the service-specific script and users are allowed to perform functions within that service defined by the role mapping.

5.5.5 Default settings

Identity service configuration settings

The identity service configuration options are described in the OpenStack documentation at keystone Configuration Options on the OpenStack site.

Default domain and service accounts

The "default" domain is automatically created during the installation to contain the various required OpenStack service accounts, including the following:

admin

heat

monasca-agent

barbican

logging

neutron

barbican_service

logging_api

nova

ceilometer

logging_beaver

nova_monasca

cinder

logging_monitor

octavia

cinderinternal

magnum

placement

demo

manila

swift

designate

manilainternal

swift-demo

glance

monasca

swift-dispersion

glance-check

monasca_read_only

swift-monitor

glance-swift

These are required accounts and are used by the underlying OpenStack services. These accounts should not be removed or reassigned to a different domain. These "default" domain should be used only for these service accounts.

5.5.6 Preinstalled roles

The following are the preinstalled roles. You can create additional roles by UIDs with the "admin" role. Roles are defined on a per-service basis (more information is available at Manage projects, users, and roles on the OpenStack website).

RoleDescription
admin

The "superuser" role. Provides full access to all SUSE OpenStack Cloud services across all domains and projects. This role should be given only to a cloud administrator.

member

A general role that enables a user to access resources within an assigned project including creating, modifying, and deleting compute, storage, and network resources.

You can find additional information on these roles in each service policy stored in the /etc/PROJECT/policy.json files where PROJECT is a placeholder for an OpenStack service. For example, the Compute (nova) service roles are stored in the /etc/nova/policy.json file. Each service policy file defines the specific API functions available to a role label.

5.6 Retrieving the Admin Password

The admin password will be used to access the dashboard and Operations Console as well as allow you to authenticate to use the command-line tools and API.

In a default SUSE OpenStack Cloud 9 installation there is a randomly generated password for the Admin user created. These steps will show you how to retrieve this password.

5.6.1 Retrieving the Admin Password

You can retrieve the randomly generated Admin password by using this command on the Cloud Lifecycle Manager:

ardana > cat ~/service.osrc

In this example output, the value for OS_PASSWORD is the Admin password:

ardana > cat ~/service.osrc
unset OS_DOMAIN_NAME
export OS_IDENTITY_API_VERSION=3
export OS_AUTH_VERSION=3
export OS_PROJECT_NAME=admin
export OS_PROJECT_DOMAIN_NAME=Default
export OS_USERNAME=admin
export OS_USER_DOMAIN_NAME=Default
export OS_PASSWORD=SlWSfwxuJY0
export OS_AUTH_URL=https://10.13.111.145:5000/v3
export OS_ENDPOINT_TYPE=internalURL
# OpenstackClient uses OS_INTERFACE instead of OS_ENDPOINT
export OS_INTERFACE=internal
export OS_CACERT=/etc/ssl/certs/ca-certificates.crt
export OS_COMPUTE_API_VERSION=2

5.7 Changing Service Passwords

SUSE OpenStack Cloud provides a process for changing the default service passwords, including your admin user password, which you may want to do for security or other purposes.

You can easily change the inter-service passwords used for authenticating communications between services in your SUSE OpenStack Cloud deployment, promoting better compliance with your organization’s security policies. The inter-service passwords that can be changed include (but are not limited to) keystone, MariaDB, RabbitMQ, Cloud Lifecycle Manager cluster, monasca and barbican.

The general process for changing the passwords is to:

  • Indicate to the configuration processor which password(s) you want to change, and optionally include the value of that password

  • Run the configuration processor to generate the new passwords (you do not need to run git add before this)

  • Run ready-deployment

  • Check your password name(s) against the tables included below to see which high-level credentials-change playbook(s) you need to run

  • Run the appropriate high-level credentials-change playbook(s)

5.7.1 Password Strength

Encryption passwords supplied to the configuration processor for use with Ansible Vault and for encrypting the configuration processor’s persistent state must have a minimum length of 12 characters and a maximum of 128 characters. Passwords must contain characters from each of the following three categories:

  • Uppercase characters (A-Z)

  • Lowercase characters (a-z)

  • Base 10 digits (0-9)

Service Passwords that are automatically generated by the configuration processor are chosen from the 62 characters made up of the 26 uppercase, the 26 lowercase, and the 10 numeric characters, with no preference given to any character or set of characters, with the minimum and maximum lengths being determined by the specific requirements of individual services.

Important
Important

Currently, you can not use any special characters with Ansible Vault, Service Passwords, or vCenter configuration.

5.7.2 Telling the configuration processor which password(s) you want to change

In SUSE OpenStack Cloud 9, the configuration processor will produce metadata about each of the passwords (and other variables) that it generates in the file ~/openstack/my_cloud/info/private_data_metadata_ccp.yml. A snippet of this file follows. Expand the header to see the file:

5.7.3 private_data_metadata_ccp.yml

metadata_proxy_shared_secret:
  metadata:
  - clusters:
    - cluster1
    component: nova-metadata
    consuming-cp: ccp
    cp: ccp
  version: '2.0'
mysql_admin_password:
  metadata:
  - clusters:
    - cluster1
    component: ceilometer
    consumes: mysql
    consuming-cp: ccp
    cp: ccp
  - clusters:
    - cluster1
    component: heat
    consumes: mysql
    consuming-cp: ccp
    cp: ccp
  - clusters:
    - cluster1
    component: keystone
    consumes: mysql
    consuming-cp: ccp
    cp: ccp
  - clusters:
    - cluster1
    - compute
    component: nova
    consumes: mysql
    consuming-cp: ccp
    cp: ccp
  - clusters:
    - cluster1
    component: cinder
    consumes: mysql
    consuming-cp: ccp
    cp: ccp
  - clusters:
    - cluster1
    component: glance
    consumes: mysql
    consuming-cp: ccp
    cp: ccp
  - clusters:
    - cluster1
    - compute
    component: neutron
    consumes: mysql
    consuming-cp: ccp
    cp: ccp
  - clusters:
    - cluster1
    component: horizon
    consumes: mysql
    consuming-cp: ccp
    cp: ccp
  version: '2.0'
mysql_barbican_password:
  metadata:
  - clusters:
    - cluster1
    component: barbican
    consumes: mysql
    consuming-cp: ccp
    cp: ccp
  version: '2.0'

For each variable, there is a metadata entry for each pair of services that use the variable including a list of the clusters on which the service component that consumes the variable (defined as "component:" in private_data_metadata_ccp.yml above) runs.

Note above that the variable mysql_admin_password is used by a number of service components, and the service that is consumed in each case is mysql, which in this context refers to the MariaDB instance that is part of the product.

5.7.4 Steps to change a password

First, make sure that you have a copy of private_data_metadata_ccp.yml. If you do not, generate one to run the configuration processor:

ardana > cd ~/openstack/ardana/ansible
ardana > ansible-playbook -i hosts/localhost config-processor-run.yml

Make a copy of the private_data_metadata_ccp.yml file and place it into the ~/openstack/change_credentials directory:

ardana > cp ~/openstack/my_cloud/info/private_data_metadata_control-plane-1.yml \
 ~/openstack/change_credentials/

Edit the copied file in ~/openstack/change_credentials leaving only those passwords you intend to change. All entries in this template file should be deleted except for those passwords.

Important
Important

If you leave other passwords in that file that you do not want to change, they will be regenerated and no longer match those in use which could disrupt operations.

Note
Note

It is required that you change passwords in batches of each category listed below.

For example, the snippet below would result in the configuration processor generating new random values for keystone_backup_password, keystone_ceilometer_password, and keystone_cinder_password:

keystone_backup_password:
  metadata:
  - clusters:
    - cluster0
    - cluster1
    - compute
    consumes: keystone-api
    consuming-cp: ccp
    cp: ccp
  version: '2.0'
keystone_ceilometer_password:
  metadata:
  - clusters:
    - cluster1
    component: ceilometer-common
    consumes: keystone-api
    consuming-cp: ccp
    cp: ccp
  version: '2.0'
keystone_cinder_password:
  metadata:
  - clusters:
    - cluster1
    component: cinder-api
    consumes: keystone-api
    consuming-cp: ccp
    cp: ccp
  version: '2.0'

5.7.5 Specifying password value

Optionally, you can specify a value for the password by including a "value:" key and value at the same level as metadata:

keystone_backup_password:
    value: 'new_password'
    metadata:
    - clusters:
        - cluster0
        - cluster1
        - compute
        consumes: keystone-api
        consuming-cp: ccp
        cp: ccp
      version: '2.0'

Note that you can have multiple files in openstack/change_credentials. The configuration processor will only read files that end in .yml or .yaml.

Note
Note

If you have specified a password value in your credential change file, you may want to encrypt it using ansible-vault. If you decide to encrypt with ansible-vault, make sure that you use the encryption key you have already used when running the configuration processor.

To encrypt a file using ansible-vault, execute:

ardana > cd ~/openstack/change_credentials
ardana > ansible-vault encrypt credential change file ending in .yml or .yaml

Be sure to provide the encryption key when prompted. Note that if you have specified the wrong ansible-vault password, the configuration-processor will error out with a message like the following:

################################################## Reading Persistent State ##################################################

################################################################################
# The configuration processor failed.
# PersistentStateCreds: User-supplied creds file test1.yml was not parsed properly
################################################################################

5.7.6 Running the configuration processor to change passwords

The directory openstack/change_credentials is not managed by git, so to rerun the configuration processor to generate new passwords and prepare for the next deployment just enter the following commands:

ardana > cd ~/openstack/ardana/ansible
ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
Note
Note

The files that you placed in ~/openstack/change_credentials should be removed once you have run the configuration processor because the old password values and new password values will be stored in the configuration processor's persistent state.

Note that if you see output like the following after running the configuration processor:

################################################################################
# The configuration processor completed with warnings.
# PersistentStateCreds: User-supplied password name 'blah' is not valid
################################################################################

this tells you that the password name you have supplied, 'blah,' does not exist. A failure to correctly parse the credentials change file will result in the configuration processor erroring out with a message like the following:

################################################## Reading Persistent State ##################################################

################################################################################
# The configuration processor failed.
# PersistentStateCreds: User-supplied creds file test1.yml was not parsed properly
################################################################################

Once you have run the configuration processor to change passwords, an information file ~/openstack/my_cloud/info/password_change.yml similar to the private_data_metadata_ccp.yml is written to tell you which passwords have been changed, including metadata but not including the values.

5.7.7 Password change playbooks and tables

Once you have completed the steps above to change password(s) value(s) and then prepare for the deployment that will actually switch over to the new passwords, you will need to run some high-level playbooks. The passwords that can be changed are grouped into six categories. The tables below list the password names that belong in each category. The categories are:

keystone

Playbook: ardana-keystone-credentials-change.yml

RabbitMQ

Playbook: ardana-rabbitmq-credentials-change.yml

MariaDB

Playbook: ardana-reconfigure.yml

Cluster:

Playbook: ardana-cluster-credentials-change.yml

monasca:

Playbook: monasca-reconfigure-credentials-change.yml

Other:

Playbook: ardana-other-credentials-change.yml

It is recommended that you change passwords in batches; in other words, run through a complete password change process for each batch of passwords, preferably in the above order. Once you have followed the process indicated above to change password(s), check the names against the tables below to see which password change playbook(s) you should run.

Changing identity service credentials

The following table lists identity service credentials you can change.

keystone credentials
Password name
barbican_admin_password
barbican_service_password
keystone_admin_pwd
keystone_ceilometer_password
keystone_cinder_password
keystone_cinderinternal_password
keystone_demo_pwd
keystone_designate_password
keystone_glance_password
keystone_glance_swift_password
keystone_heat_password
keystone_magnum_password
keystone_monasca_agent_password
keystone_monasca_password
keystone_neutron_password
keystone_nova_password
keystone_octavia_password
keystone_swift_dispersion_password
keystone_swift_monitor_password
keystone_swift_password
nova_monasca_password

The playbook to run to change keystone credentials is ardana-keystone-credentials-change.yml. Execute the following commands to make the changes:

ardana > cd ~/scratch/ansible/next/ardana/ansible/
ardana > ansible-playbook -i hosts/verb_hosts ardana-keystone-credentials-change.yml

Changing RabbitMQ credentials

The following table lists the RabbitMQ credentials you can change.

RabbitMQ credentials
Password name
rmq_barbican_password
rmq_ceilometer_password
rmq_cinder_password
rmq_designate_password
rmq_keystone_password
rmq_magnum_password
rmq_monasca_monitor_password
rmq_nova_password
rmq_octavia_password
rmq_service_password

The playbook to run to change RabbitMQ credentials is ardana-rabbitmq-credentials-change.yml. Execute the following commands to make the changes:

ardana > cd ~/scratch/ansible/next/ardana/ansible/
ardana > ansible-playbook -i hosts/verb_hosts ardana-rabbitmq-credentials-change.yml

Changing MariaDB credentials

The following table lists the MariaDB credentials you can change.

MariaDB credentials
Password name
mysql_admin_password
mysql_barbican_password
mysql_clustercheck_pwd
mysql_designate_password
mysql_magnum_password
mysql_monasca_api_password
mysql_monasca_notifier_password
mysql_monasca_thresh_password
mysql_octavia_password
mysql_powerdns_password
mysql_root_pwd
mysql_sst_password
ops_mon_mdb_password
mysql_monasca_transform_password
mysql_nova_api_password
password

The playbook to run to change MariaDB credentials is ardana-reconfigure.yml. To make the changes, execute the following commands:

ardana > cd ~/scratch/ansible/next/ardana/ansible/
ardana > ansible-playbook -i hosts/verb_hosts ardana-reconfigure.yml

Changing cluster credentials

The following table lists the cluster credentials you can change.

cluster credentials
Password name
haproxy_stats_password
keepalive_vrrp_password

The playbook to run to change cluster credentials is ardana-cluster-credentials-change.yml. To make changes, execute the following commands:

ardana > cd ~/scratch/ansible/next/ardana/ansible/
ardana > ansible-playbook -i hosts/verb_hosts ardana-cluster-credentials-change.yml

Changing monasca credentials

The following table lists the monasca credentials you can change.

monasca credentials
Password name
cassandra_monasca_api_password
cassandra_monasca_persister_password

The playbook to run to change monasca credentials is monasca-reconfigure-credentials-change.yml. To make the changes, execute the following commands:

ardana > cd ~/scratch/ansible/next/ardana/ansible/
ardana > ansible-playbook -i hosts/verb_hosts monasca-reconfigure-credentials-change.yml

Changing other credentials

The following table lists the other credentials you can change.

Other credentials
Password name
logging_beaver_password
logging_api_password
logging_monitor_password
logging_kibana_password

The playbook to run to change these credentials is ardana-other-credentials-change.yml. To make the changes, execute the following commands:

ardana > cd ~/scratch/ansible/next/ardana/ansible/
ardana > ansible-playbook -i hosts/verb_hosts ardana-other-credentials-change.yml

5.7.8 Changing RADOS Gateway Credential

To change the keystone credentials of RADOS Gateway, follow the preceding steps documented in Section 5.7, “Changing Service Passwords” by modifying the keystone_rgw_password section in private_data_metadata_ccp.yml file in Section 5.7.4, “Steps to change a password” or Section 5.7.5, “Specifying password value”.

5.7.9 Immutable variables

The values of certain variables are immutable, which means that once they have been generated by the configuration processor they cannot be changed. These variables are:

  • barbican_master_kek_db_plugin

  • swift_hash_path_suffix

  • swift_hash_path_prefix

  • mysql_cluster_name

  • heartbeat_key

  • erlang_cookie

The configuration processor will not re-generate the values of the above passwords, nor will it allow you to specify a value for them. In addition to the above variables, the following are immutable in SUSE OpenStack Cloud 9:

  • All ssh keys generated by the configuration processor

  • All UUIDs generated by the configuration processor

  • metadata_proxy_shared_secret

  • horizon_secret_key

  • ceilometer_metering_secret

5.8 Reconfiguring the Identity service

5.8.1 Updating the keystone Identity Service

This topic explains configuration options for the Identity service.

SUSE OpenStack Cloud lets you perform updates on the following parts of the Identity service configuration:

5.8.2 Updating the Main Identity service Configuration File

  1. The main keystone Identity service configuration file (/etc/keystone/keystone.conf), located on each control plane server, is generated from the following template file located on a Cloud Lifecycle Manager: ~/openstack/my_cloud/config/keystone/keystone.conf.j2

    Modify this template file as appropriate. See keystone Liberty documentation for full descriptions of all settings. This is a Jinja2 template, which expects certain template variables to be set. Do not change values inside double curly braces: {{ }}.

    Note
    Note

    SUSE OpenStack Cloud 9 has the following token expiration setting, which differs from the upstream value 3600:

    [token]
    expiration = 14400
  2. After you modify the template, commit the change to the local git repository, and rerun the configuration processor / deployment area preparation playbooks (as suggested in Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”):

    ardana > cd ~/openstack
    ardana > git checkout site
    ardana > git add my_cloud/config/keystone/keystone.conf.j2
    ardana > git commit -m "Adjusting some parameters in keystone.conf"
    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  3. Run the reconfiguration playbook in the deployment area:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts keystone-reconfigure.yml

5.8.3 Enabling Identity Service Features

To enable or disable keystone features, do the following:

  1. Adjust respective parameters in ~/openstack/my_cloud/config/keystone/keystone_deploy_config.yml

  2. Commit the change into local git repository, and rerun the configuration processor/deployment area preparation playbooks (as suggested in Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”):

    ardana > cd ~/openstack
    ardana > git checkout site
    ardana > git add my_cloud/config/keystone/keystone_deploy_config.yml
    ardana > git commit -m "Adjusting some WSGI or logging parameters for keystone"
    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  3. Run the reconfiguration playbook in the deployment area:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts keystone-reconfigure.yml

5.8.4 Fernet Tokens

SUSE OpenStack Cloud 9 supports Fernet tokens by default. The benefit of using Fernet tokens is that tokens are not persisted in a database, which is helpful if you want to deploy the keystone Identity service as one master and multiple slaves; only roles, projects, and other details are replicated from master to slaves. The token table is not replicated.

Note
Note

Tempest does not work with Fernet tokens in SUSE OpenStack Cloud 9. If Fernet tokens are enabled, do not run token tests in Tempest.

Note
Note

During reconfiguration when switching to a Fernet token provider or during Fernet key rotation, you may see a warning in keystone.log stating [fernet_tokens] key_repository is world readable: /etc/keystone/fernet-keys/. This is expected. You can safely ignore this message. For other keystone operations, this warning is not displayed. Directory permissions are set to 600 (read/write by owner only), not world readable.

Fernet token-signing key rotation is being handled by a cron job, which is configured on one of the controllers. The controller with the Fernet token-signing key rotation cron job is also known as the Fernet Master node. By default, the Fernet token-signing key is rotated once every 24 hours. The Fernet token-signing keys are distributed from the Fernet Master node to the rest of the controllers at each rotation. Therefore, the Fernet token-signing keys are consistent for all the controlers at all time.

When enabling Fernet token provider the first time, specific steps are needed to set up the necessary mechanisms for Fernet token-signing key distributions.

  1. Set keystone_configure_fernet to True in ~/openstack/my_cloud/config/keystone/keystone_deploy_config.yml.

  2. Run the following commands to commit your change in Git and enable Fernet:

    ardana > git add my_cloud/config/keystone/keystone_deploy_config.yml
    ardana > git commit -m "enable Fernet token provider"
    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts keystone-deploy.yml

When the Fernet token provider is enabled, a Fernet Master alarm definition is also created on monasca to monitor the Fernet Master node. If the Fernet Master node is offline or unreachable, a CRITICAL alarm is raised for the Cloud Admin to take corrective actions. If the Fernet Master node is offline for a prolonged period of time, Fernet token-signing key rotation is not performed. This may introduce security risks to the cloud. The Cloud Admin must take immediate actions to resurrect the Fernet Master node.

5.9 Integrating LDAP with the Identity Service

5.9.1 Integrating with an external LDAP server

The keystone identity service provides two primary functions: user authentication and access authorization. The user authentication function validates a user's identity. keystone has a very basic user management system that can be used to create and manage user login and password credentials but this system is intended only for proof of concept deployments due to the very limited password control functions. The internal identity service user management system is also commonly used to store and authenticate OpenStack-specific service account information.

The recommended source of authentication is external user management systems such as LDAP directory services. The identity service can be configured to connect to and use external systems as the source of user authentication. The identity service domain construct is used to define different authentication sources based on domain membership. For example, cloud deployment could consist of as few as two domains:

  • The default domain that is pre-configured for the service account users that are authenticated directly against the identity service internal user management system

  • A customer-defined domain that contains all user projects and membership definitions. This domain can then be configured to use an external LDAP directory such as Microsoft Active Directory as the authentication source.

SUSE OpenStack Cloud can support multiple domains for deployments that support multiple tenants. Multiple domains can be created with each domain configured to either the same or different external authentication sources. This deployment model is known as a "per-domain" model.

There are currently two ways to configure "per-domain" authentication sources:

  • File store – each domain configuration is created and stored in separate text files. This is the older and current default method for defining domain configurations.

  • Database store – each domain configuration can be created using either the identity service manager utility (recommenced) or a Domain Admin API (from OpenStack.org), and the results are stored in the identity service MariaDB database. This database store is a new method introduced in the OpenStack Kilo release and now available in SUSE OpenStack Cloud.

Instructions for initially creating per-domain configuration files and then migrating to the Database store method via the identity service manager utility are provided as follows.

5.9.2 Set up domain-specific driver configuration - file store

To update configuration to a specific LDAP domain:

  1. Ensure that the following configuration options are in the main configuration file template: ~/openstack/my_cloud/config/keystone/keystone.conf.j2

    [identity]
    domain_specific_drivers_enabled = True
    domain_configurations_from_database = False
  2. Create a YAML file that contains the definition of the LDAP server connection. The sample file below is already provided as part of the Cloud Lifecycle Manager in the Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”. It is available on the Cloud Lifecycle Manager in the following file:

    ~/openstack/my_cloud/config/keystone/keystone_configure_ldap_sample.yml

    Save a copy of this file with a new name, for example:

    ~/openstack/my_cloud/config/keystone/keystone_configure_ldap_my.yml
    Note
    Note

    Please refer to the LDAP section of the keystone configuration example for OpenStack for the full option list and description.

    Below are samples of YAML configurations for identity service LDAP certificate settings, optimized for Microsoft Active Directory server.

    Sample YAML configuration keystone_configure_ldap_my.yml

    ---
    keystone_domainldap_conf:
    
        # CA certificates file content.
        # Certificates are stored in Base64 PEM format. This may be entire LDAP server
        # certificate (in case of self-signed certificates), certificate of authority
        # which issued LDAP server certificate, or a full certificate chain (Root CA
        # certificate, intermediate CA certificate(s), issuer certificate).
        #
        cert_settings:
          cacert: |
            -----BEGIN CERTIFICATE-----
    
            certificate appears here
    
            -----END CERTIFICATE-----
    
        # A domain will be created in MariaDB with this name, and associated with ldap back end.
        # Installer will also generate a config file named /etc/keystone/domains/keystone.<domain_name>.conf
        #
        domain_settings:
          name: ad
          description: Dedicated domain for ad users
    
        conf_settings:
          identity:
             driver: ldap
    
    
          # For a full list and description of ldap configuration options, please refer to
          # https://github.com/openstack/keystone/blob/master/etc/keystone.conf.sample or
          # http://docs.openstack.org/liberty/config-reference/content/keystone-configuration-file.html.
          #
          # Please note:
          #  1. LDAP configuration is read-only. Configuration which performs write operations (i.e. creates users, groups, etc)
          #     is not supported at the moment.
          #  2. LDAP is only supported for identity operations (reading users and groups from LDAP). Assignment
          #     operations with LDAP (i.e. managing roles, projects) are not supported.
          #  3. LDAP is configured as non-default domain. Configuring LDAP as a default domain is not supported.
          #
          ldap:
            url: ldap://ad.hpe.net
            suffix: DC=hpe,DC=net
            query_scope: sub
            user_tree_dn: CN=Users,DC=hpe,DC=net
            user : CN=admin,CN=Users,DC=hpe,DC=net
            password: REDACTED
            user_objectclass: user
            user_id_attribute: cn
            user_name_attribute: cn
            group_tree_dn: CN=Users,DC=hpe,DC=net
            group_objectclass: group
            group_id_attribute: cn
            group_name_attribute: cn
            use_pool: True
            user_enabled_attribute: userAccountControl
            user_enabled_mask: 2
            user_enabled_default: 512
            use_tls: True
            tls_req_cert: demand
            # if you are configuring multiple LDAP domains, and LDAP server certificates are issued
            # by different authorities, make sure that you place certs for all the LDAP backend domains in the
            # cacert parameter as seen in this sample yml file so that all the certs are combined in a single CA file
            # and every LDAP domain configuration points to the combined CA file.
            # Note:
            # 1. Please be advised that every time a new ldap domain is configured, the single CA file gets overwritten
            # and hence ensure that you place certs for all the LDAP backend domains in the cacert parameter.
            # 2. There is a known issue on one cert per CA file per domain when the system processes
            # concurrent requests to multiple LDAP domains. Using the single CA file with all certs combined
            # shall get the system working properly*.
    
            tls_cacertfile: /etc/keystone/ssl/certs/all_ldapdomains_ca.pem
    
            # The issue is in the underlying SSL library. Upstream is not investing in python-ldap package anymore.
            # It is also not python3 compliant.
    keystone_domain_MSAD_conf:
    
        # CA certificates file content.
        # Certificates are stored in Base64 PEM format. This may be entire LDAP server
        # certificate (in case of self-signed certificates), certificate of authority
        # which issued LDAP server certificate, or a full certificate chain (Root CA
        # certificate, intermediate CA certificate(s), issuer certificate).
        #
        cert_settings:
          cacert: |
            -----BEGIN CERTIFICATE-----
    
            certificate appears here
    
            -----END CERTIFICATE-----
    
        # A domain will be created in MariaDB with this name, and associated with ldap back end.
        # Installer will also generate a config file named /etc/keystone/domains/keystone.<domain_name>.conf
        #
            domain_settings:
              name: msad
              description: Dedicated domain for msad users
    
            conf_settings:
              identity:
                driver: ldap
    
        # For a full list and description of ldap configuration options, please refer to
        # https://github.com/openstack/keystone/blob/master/etc/keystone.conf.sample or
        # http://docs.openstack.org/liberty/config-reference/content/keystone-configuration-file.html.
        #
        # Please note:
        #  1. LDAP configuration is read-only. Configuration which performs write operations (i.e. creates users, groups, etc)
        #     is not supported at the moment.
        #  2. LDAP is only supported for identity operations (reading users and groups from LDAP). Assignment
        #     operations with LDAP (i.e. managing roles, projects) are not supported.
        #  3. LDAP is configured as non-default domain. Configuring LDAP as a default domain is not supported.
        #
        ldap:
          # If the url parameter is set to ldap then typically use_tls should be set to True. If
          # url is set to ldaps, then use_tls should be set to False
          url: ldaps://10.16.22.5
          use_tls: False
          query_scope: sub
          user_tree_dn: DC=l3,DC=local
          # this is the user and password for the account that has access to the AD server
          user: administrator@l3.local
          password: OpenStack123
          user_objectclass: user
          # For a default Active Directory schema this is where to find the user name, openldap uses a different value
          user_id_attribute: userPrincipalName
          user_name_attribute: sAMAccountName
          group_tree_dn: DC=l3,DC=local
          group_objectclass: group
          group_id_attribute: cn
          group_name_attribute: cn
          # An upstream defect requires use_pool to be set false
          use_pool: False
          user_enabled_attribute: userAccountControl
          user_enabled_mask: 2
          user_enabled_default: 512
          tls_req_cert: allow
          # Referals may contain urls that can't be resolved and will cause timeouts, ignore them
          chase_referrals: False
          # if you are configuring multiple LDAP domains, and LDAP server certificates are issued
          # by different authorities, make sure that you place certs for all the LDAP backend domains in the
          # cacert parameter as seen in this sample yml file so that all the certs are combined in a single CA file
          # and every LDAP domain configuration points to the combined CA file.
          # Note:
          # 1. Please be advised that every time a new ldap domain is configured, the single CA file gets overwritten
          # and hence ensure that you place certs for all the LDAP backend domains in the cacert parameter.
          # 2. There is a known issue on one cert per CA file per domain when the system processes
          # concurrent requests to multiple LDAP domains. Using the single CA file with all certs combined
          # shall get the system working properly.
    
          tls_cacertfile: /etc/keystone/ssl/certs/all_ldapdomains_ca.pem
  3. As suggested in Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”, commit the new file to the local git repository, and rerun the configuration processor and ready deployment playbooks:

    ardana > cd ~/openstack
    ardana > git checkout site
    ardana > git add my_cloud/config/keystone/keystone_configure_ldap_my.yml
    ardana > git commit -m "Adding LDAP server integration config"
    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  4. Run the reconfiguration playbook in a deployment area, passing the YAML file created in the previous step as a command-line option:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts keystone-reconfigure.yml -e@~/openstack/my_cloud/config/keystone/keystone_configure_ldap_my.yml
  5. Follow these same steps for each LDAP domain with which you are integrating the identity service, creating a YAML file for each and running the reconfigure playbook once for each additional domain.

  6. Ensure that a new domain was created for LDAP (Microsoft AD in this example) and set environment variables for admin level access

    ardana > source keystone.osrc

    Get a list of domains

    ardana > openstack domain list

    As output here:

    +----------------------------------+---------+---------+----------------------------------------------------------------------+
    | ID                               | Name    | Enabled | Description                                                          |
    +----------------------------------+---------+---------+----------------------------------------------------------------------+
    | 6740dbf7465a4108a36d6476fc967dbd | heat    | True    | Owns users and projects created by heat                              |
    | default                          | Default | True    | Owns users and tenants (i.e. projects) available on Identity API v2. |
    | b2aac984a52e49259a2bbf74b7c4108b | ad      | True    | Dedicated domain for users managed by Microsoft AD server            |
    +----------------------------------+---------+---------+----------------------------------------------------------------------+
    Note
    Note

    LDAP domain is read-only. This means that you cannot create new user or group records in it.

  7. Once the LDAP user is granted the appropriate role, you can authenticate within the specified domain. Set environment variables for admin-level access

    ardana > source keystone.osrc

    Get user record within the ad (Active Directory) domain

    ardana > openstack user show testuser1 --domain ad

    Note the output:

    +-----------+------------------------------------------------------------------+
    | Field     | Value                                                            |
    +-----------+------------------------------------------------------------------+
    | domain_id | 143af847018c4dc7bd35390402395886                                 |
    | id        | e6d8c90abdc4510621271b73cc4dda8bc6009f263e421d8735d5f850f002f607 |
    | name      | testuser1                                                        |
    +-----------+------------------------------------------------------------------+

    Now, get list of LDAP groups:

    ardana > openstack group list --domain ad

    Here you see testgroup1 and testgroup2:

    +------------------------------------------------------------------+------------+
    |  ID                                                              | Name       |
    +------------------------------------------------------------------+------------+
    |  03976b0ea6f54a8e4c0032e8f756ad581f26915c7e77500c8d4aaf0e83afcdc6| testgroup1 |
    7ba52ee1c5829d9837d740c08dffa07ad118ea1db2d70e0dc7fa7853e0b79fcf   | testgroup2 |
    +------------------------------------------------------------------+------------+

    Create a new role. Note that the role is not bound to the domain.

    ardana > openstack role create testrole1

    Testrole1 has been created:

    +-------+----------------------------------+
    | Field | Value                            |
    +-------+----------------------------------+
    | id    | 02251585319d459ab847409dea527dee |
    | name  | testrole1                        |
    +-------+----------------------------------+

    Grant the user a role within the domain by executing the code below. Note that due to a current OpenStack CLI limitation, you must use the user ID rather than the user name when working with a non-default domain.

    ardana > openstack role add testrole1 --user e6d8c90abdc4510621271b73cc4dda8bc6009f263e421d8735d5f850f002f607 --domain ad

    Verify that the role was successfully granted, as shown here:

    ardana > openstack role assignment list --user e6d8c90abdc4510621271b73cc4dda8bc6009f263e421d8735d5f850f002f607 --domain ad
    +----------------------------------+------------------------------------------------------------------+-------+---------+----------------------------------+
    | Role                             | User                                                             | Group | Project | Domain                           |
    +----------------------------------+------------------------------------------------------------------+-------+---------+----------------------------------+
    | 02251585319d459ab847409dea527dee | e6d8c90abdc4510621271b73cc4dda8bc6009f263e421d8735d5f850f002f607 |       |         | 143af847018c4dc7bd35390402395886 |
    +----------------------------------+------------------------------------------------------------------+-------+---------+----------------------------------+

    Authenticate (get a domain-scoped token) as a new user with a new role. The --os-* command-line parameters specified below override the respective OS_* environment variables set by the keystone.osrc script to provide admin access. To ensure that the command below is executed in a clean environment, you may want log out from the node and log in again.

    ardana > openstack --os-identity-api-version 3 \
                --os-username testuser1 \
                --os-password testuser1_password \
                --os-auth-url http://10.0.0.6:35357/v3 \
                --os-domain-name ad \
                --os-user-domain-name ad \
                token issue

    Here is the result:

    +-----------+------------------------------------------------------------------+
    | Field     | Value                                                            |
    +-----------+------------------------------------------------------------------+
    | domain_id | 143af847018c4dc7bd35390402395886                                 |
    | expires   | 2015-09-09T21:36:15.306561Z                                      |
    | id        | 6f8f9f1a932a4d01b7ad9ab061eb0917                                 |
    | user_id   | e6d8c90abdc4510621271b73cc4dda8bc6009f263e421d8735d5f850f002f607 |
    +-----------+------------------------------------------------------------------+
  8. Users can also have a project within the domain and get a project-scoped token. To accomplish this, set environment variables for admin level access:

    ardana > source keystone.osrc

    Then create a new project within the domain:

    ardana > openstack project create testproject1 --domain ad

    The result shows that they have been created:

    +-------------+----------------------------------+
    | Field       | Value                            |
    +-------------+----------------------------------+
    | description |                                  |
    | domain_id   | 143af847018c4dc7bd35390402395886 |
    | enabled     | True                             |
    | id          | d065394842d34abd87167ab12759f107 |
    | name        | testproject1                     |
    +-------------+----------------------------------+

    Grant the user a role with a project, re-using the role created in the previous example. Note that due to a current OpenStack CLI limitation, you must use user ID rather than user name when working with a non-default domain.

    ardana > openstack role add testrole1 --user e6d8c90abdc4510621271b73cc4dda8bc6009f263e421d8735d5f850f002f607 --project testproject1

    Verify that the role was successfully granted by generating a list:

    ardana > openstack role assignment list --user e6d8c90abdc4510621271b73cc4dda8bc6009f263e421d8735d5f850f002f607 --project testproject1

    The output shows the result:

    +----------------------------------+------------------------------------------------------------------+-------+----------------------------------+--------+
    | Role                             | User                                                             | Group | Project                          | Domain |
    +----------------------------------+------------------------------------------------------------------+-------+----------------------------------+--------+
    | 02251585319d459ab847409dea527dee | e6d8c90abdc4510621271b73cc4dda8bc6009f263e421d8735d5f850f002f607 |       | d065394842d34abd87167ab12759f107 |        |
    +----------------------------------+------------------------------------------------------------------+-------+----------------------------------+--------+

    Authenticate (get a project-scoped token) as the new user with a new role. The --os-* command line parameters specified below override their respective OS_* environment variables set by keystone.osrc to provide admin access. To ensure that the command below is executed in a clean environment, you may want log out from the node and log in again. Note that both the --os-project-domain-name and --os-project-user-name parameters are needed to verify that both user and project are not in the default domain.

    ardana > openstack --os-identity-api-version 3 \
                --os-username testuser1 \
                --os-password testuser1_password \
                --os-auth-url http://10.0.0.6:35357/v3 \
                --os-project-name testproject1 \
                --os-project-domain-name ad \
                --os-user-domain-name ad \
                token issue

    Below is the result:

    +------------+------------------------------------------------------------------+
    | Field      | Value                                                            |
    +------------+------------------------------------------------------------------+
    | expires    | 2015-09-09T21:50:49.945893Z                                      |
    | id         | 328e18486f69441fb13f4842423f52d1                                 |
    | project_id | d065394842d34abd87167ab12759f107                                 |
    | user_id    | e6d8c90abdc4510621271b73cc4dda8bc6009f263e421d8735d5f850f002f607 |
    +------------+------------------------------------------------------------------+

5.9.3 Set up or switch to domain-specific driver configuration using a database store

To make the switch, execute the steps below. Remember, you must have already set up the configuration for a file store as explained in Section 5.9.2, “Set up domain-specific driver configuration - file store”, and it must be working properly.

  1. Ensure that the following configuration options are set in the main configuration file, ~/openstack/my_cloud/config/keystone/keystone.conf.j2:

    [identity]
    domain_specific_drivers_enabled = True
    domain_configurations_from_database = True
    
    [domain_config]
    driver = sql
  2. Once the template is modified, commit the change to the local git repository, and rerun the configuration processor / deployment area preparation playbooks (as suggested at Using Git for Configuration Management):

    ardana > cd ~/openstack
    ardana > git checkout site
    ardana > git add -A

    Verify that the files have been added using git status:

    ardana > git status

    Then commit the changes:

    ardana > git commit -m "Use Domain-Specific Driver Configuration - Database Store: more description here..."

    Next, run the configuration processor and ready deployment playbooks:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  3. Run the reconfiguration playbook in a deployment area:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts keystone-reconfigure.yml
  4. Upload the domain-specific config files to the database if they have not been loaded. If they have already been loaded and you want to switch back to database store mode, then skip this upload step and move on to step 5.

    1. Go to one of the controller nodes where keystone is deployed.

    2. Verify that domain-specific driver configuration files are located under the directory (default /etc/keystone/domains) with the format: keystone.<domain name>.conf Use the keystone manager utility to load domain-specific config files to the database. There are two options for uploading the files:

      1. Option 1: Upload all configuration files to the SQL database:

        ardana > keystone-manage domain_config_upload --all
      2. Option 2: Upload individual domain-specific configuration files by specifying the domain name one by one:

        ardana > keystone-manage domain_config_upload --domain-name domain name

        Here is an example:

        keystone-manage domain_config_upload --domain-name ad

        Note that the keystone manager utility does not upload the domain-specific driver configuration file the second time for the same domain. For the management of the domain-specific driver configuration in the database store, you may refer to OpenStack Identity API - Domain Configuration.

  5. Verify that the switched domain driver configuration for LDAP (Microsoft AD in this example) in the database store works properly. Then set the environment variables for admin level access:

    ardana > source ~/keystone.osrc

    Get a list of domain users:

    ardana > openstack user list --domain ad

    Note the three users returned:

    +------------------------------------------------------------------+------------+
    | ID                                                               | Name       |
    +------------------------------------------------------------------+------------+
    | e7dbec51ecaf07906bd743debcb49157a0e8af557b860a7c1dadd454bdab03fe | testuser1  |
    | 8a09630fde3180c685e0cd663427e8638151b534a8a7ccebfcf244751d6f09bd | testuser2  |
    | ea463d778dadcefdcfd5b532ee122a70dce7e790786678961420ae007560f35e | testuser3  |
    +------------------------------------------------------------------+------------+

    Get user records within the ad domain:

    ardana > openstack user show testuser1 --domain ad

    Here testuser1 is returned:

    +-----------+------------------------------------------------------------------+
    | Field     | Value                                                            |
    +-----------+------------------------------------------------------------------+
    | domain_id | 143af847018c4dc7bd35390402395886                                 |
    | id        | e6d8c90abdc4510621271b73cc4dda8bc6009f263e421d8735d5f850f002f607 |
    | name      | testuser1                                                        |
    +-----------+------------------------------------------------------------------+

    Get a list of LDAP groups:

    ardana > openstack group list --domain ad

    Note that testgroup1 and testgroup2 are returned:

    +------------------------------------------------------------------+------------+
    | ID                                                               | Name       |
    +------------------------------------------------------------------+------------+
    | 03976b0ea6f54a8e4c0032e8f756ad581f26915c7e77500c8d4aaf0e83afcdc6 | testgroup1 |
    | 7ba52ee1c5829d9837d740c08dffa07ad118ea1db2d70e0dc7fa7853e0b79fcf | testgroup2 |
    +------------------------------------------------------------------+------------+
    Note
    Note

    LDAP domain is read-only. This means that you cannot create new user or group records in it.

5.9.4 Domain-specific driver configuration. Switching from a database to a file store

Following is the procedure to switch a domain-specific driver configuration from a database store to a file store. It is assumed that:

  • The domain-specific driver configuration with a database store has been set up and is working properly.

  • Domain-specific driver configuration files with the format: keystone.<domain name>.conf have already been located and verified in the specific directory (by default, /etc/keystone/domains/) on all of the controller nodes.

  1. Ensure that the following configuration options are set in the main configuration file template in ~/openstack/my_cloud/config/keystone/keystone.conf.j2:

    [identity]
     domain_specific_drivers_enabled = True
     domain_configurations_from_database = False
    
    [domain_config]
    # driver = sql
  2. Once the template is modified, commit the change to the local git repository, and rerun the configuration processor / deployment area preparation playbooks (as suggested at Using Git for Configuration Management):

    ardana > cd ~/openstack
    ardana > git checkout site
    ardana > git add -A

    Verify that the files have been added using git status, then commit the changes:

    ardana > git status
    ardana > git commit -m "Domain-Specific Driver Configuration - Switch From Database Store to File Store: more description here..."

    Then run the configuration processor and ready deployment playbooks:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  3. Run reconfiguration playbook in a deployment area:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts keystone-reconfigure.yml
  4. Verify that the switched domain driver configuration for LDAP (Microsoft AD in this example) using file store works properly: Set environment variables for admin level access

    ardana > source ~/keystone.osrc

    Get list of domain users:

    ardana > openstack user list --domain ad

    Here you see the three users:

    +------------------------------------------------------------------+------------+
    | ID                                                               | Name       |
    +------------------------------------------------------------------+------------+
    | e7dbec51ecaf07906bd743debcb49157a0e8af557b860a7c1dadd454bdab03fe | testuser1  |
    | 8a09630fde3180c685e0cd663427e8638151b534a8a7ccebfcf244751d6f09bd | testuser2  |
    | ea463d778dadcefdcfd5b532ee122a70dce7e790786678961420ae007560f35e | testuser3  |
    +------------------------------------------------------------------+------------+

    Get user records within the ad domain:

    ardana > openstack user show testuser1 --domain ad

    Here is the result:

    +-----------+------------------------------------------------------------------+
    | Field     | Value                                                            |
    +-----------+------------------------------------------------------------------+
    | domain_id | 143af847018c4dc7bd35390402395886                                 |
    | id        | e6d8c90abdc4510621271b73cc4dda8bc6009f263e421d8735d5f850f002f607 |
    | name      | testuser1                                                        |
    +-----------+------------------------------------------------------------------+

    Get a list of LDAP groups:

    ardana > openstack group list --domain ad

    Here are the groups returned:

    +------------------------------------------------------------------+------------+
    | ID                                                               | Name       |
    +------------------------------------------------------------------+------------+
    | 03976b0ea6f54a8e4c0032e8f756ad581f26915c7e77500c8d4aaf0e83afcdc6 | testgroup1 |
    | 7ba52ee1c5829d9837d740c08dffa07ad118ea1db2d70e0dc7fa7853e0b79fcf | testgroup2 |
    +------------------------------------------------------------------+------------+

    Note: Note: LDAP domain is read-only. This means that you can not create new user or group record in it.

5.9.5 Update LDAP CA certificates

There is a chance that LDAP CA certificates may expire or for some reason not work anymore. Below are steps to update the LDAP CA certificates on the identity service side. Follow the steps below to make the updates.

  1. Locate the file keystone_configure_ldap_certs_sample.yml

    ~/openstack/my_cloud/config/keystone/keystone_configure_ldap_certs_sample.yml
  2. Save a copy of this file with a new name, for example:

    ~/openstack/my_cloud/config/keystone/keystone_configure_ldap_certs_all.yml
  3. Edit the file and specify the correct single file path name for the ldap certificates. This file path name has to be consistent with the one defined in tls_cacertfile of the domain-specific configuration. Edit the file and populate or update it with LDAP CA certificates for all LDAP domains.

  4. As suggested in Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”, add the new file to the local git repository:

    ardana > cd ~/openstack
    ardana > git checkout site
    ardana > git add -A

    Verify that the files have been added using git status and commit the file:

    ardana > git status
    ardana > git commit -m "Update LDAP CA certificates: more description here..."

    Then run the configuration processor and ready deployment playbooks:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  5. Run the reconfiguration playbook in the deployment area:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts keystone-reconfigure.yml -e@~/openstack/my_cloud/config/keystone/keystone_configure_ldap_certs_all.yml

5.9.6 Limitations

SUSE OpenStack Cloud 9 domain-specific configuration:

  • No Global User Listing: Once domain-specific driver configuration is enabled, listing all users and listing all groups are not supported operations. Those calls require a specific domain filter and a domain-scoped token for the target domain.

  • You cannot have both a file store and a database store for domain-specific driver configuration in a single identity service instance. Once a database store is enabled within the identity service instance, any file store will be ignored, and vice versa.

  • The identity service allows a list limit configuration to globally set the maximum number of entities that will be returned in an identity collection per request but it does not support per-domain list limit setting at this time.

  • Each time a new domain is configured with LDAP integration the single CA file gets overwritten. Ensure that you place certs for all the LDAP back-end domains in the cacert parameter. Detailed CA file inclusion instructions are provided in the comments of the sample YAML configuration file keystone_configure_ldap_my.yml (Section 5.9.2, “Set up domain-specific driver configuration - file store”).

  • LDAP is only supported for identity operations (reading users and groups from LDAP).

  • keystone assignment operations from LDAP records such as managing or assigning roles and projects, are not currently supported.

  • The SUSE OpenStack Cloud 'default' domain is pre-configured to store service account users and is authenticated locally against the identity service. Domains configured for external LDAP integration are non-default domains.

  • When using the current OpenStackClient CLI you must use the user ID rather than the user name when working with a non-default domain.

  • Each LDAP connection with the identity service is for read-only operations. Configurations that require identity service write operations (to create users, groups, etc.) are not currently supported.

  • LDAP is only supported for identity operations (reading users and groups from LDAP). keystone assignment operations from LDAP records such as managing or assigning roles and projects, are not currently supported.

  • When using the current OpenStackClient CLI you must use the user ID rather than the user name when working with a non-default domain.

SUSE OpenStack Cloud 9 API-based domain-specific configuration management

  • No GUI dashboard for domain-specific driver configuration management

  • API-based Domain specific config does not check for type of option.

  • API-based Domain specific config does not check for option values supported.

  • API-based Domain config method does not provide retrieval of default values of domain-specific configuration options.

  • Status: Domain-specific driver configuration database store is a non-core feature for SUSE OpenStack Cloud 9.

Note
Note

When integrating with an external identity provider, cloud security is dependent upon the security of that identify provider. You should examine the security of the identity provider, and in particular the SAML 2.0 token generation process and decide what security properties you need to ensure adequate security of your cloud deployment. More information about SAML can be found at https://www.owasp.org/index.php/SAML_Security_Cheat_Sheet.

5.10 keystone-to-keystone Federation

This topic explains how you can use one instance of keystone as an identity provider and one as a service provider.

5.10.1 What Is Keystone-to-Keystone Federation?

Identity federation lets you configure SUSE OpenStack Cloud using existing identity management systems such as an LDAP directory as the source of user access authentication. The keystone-to-keystone federation (K2K) function extends this concept for accessing resources in multiple, separate SUSE OpenStack Cloud clouds. You can configure each cloud to trust the authentication credentials of other clouds to provide the ability for users to authenticate with their home cloud and to access authorized resources in another cloud without having to reauthenticate with the remote cloud. This function is sometimes referred to as "single sign-on" or SSO.

The SUSE OpenStack Cloud cloud that provides the initial user authentication is called the identity provider (IdP). The identity provider cloud can support domain-based authentication against external authentication sources including LDAP-based directories such as Microsoft Active Directory. The identity provider creates the user attributes, known as assertions, which are used to automatically authenticate users with other SUSE OpenStack Cloud clouds.

An SUSE OpenStack Cloud cloud that provides resources is called a service provider (SP). A service provider cloud accepts user authentication assertions from the identity provider and provides access to project resources based on the mapping file settings developed for each service provider cloud. The following are characteristics of a service provider:

  • Each service provider cloud has a unique set of projects, groups, and group role assignments that are created and managed locally.

  • The mapping file consists a set of rules that define user group membership.

  • The mapping file enables the ability to auto-assign incoming users to a specific group. Project membership and access are defined by group membership.

  • Project quotas are defined locally by each service provider cloud.

keystone-to-keystone federation is supported and enabled in SUSE OpenStack Cloud 9 using configuration parameters in specific Ansible files. Instructions are provided to define and enable the required configurations.

Support for keystone-to-keystone federation happens on the API level, and you must implement it using your own client code by calling the supported APIs. Python-keystoneclient has supported APIs to access the K2K APIs.

Example 5.1: k2kclient.py

The following k2kclient.py file is an example, and the request diagram Figure 5.1, “Keystone Authentication Flow” explains the flow of client requests.

import json
import os
import requests

import xml.dom.minidom

from keystoneclient.auth.identity import v3
from keystoneclient import session

class K2KClient(object):

    def __init__(self):
        # IdP auth URL
        self.auth_url = "http://192.168.245.9:35357/v3/"
        self.project_name = "admin"
        self.project_domain_name = "Default"
        self.username = "admin"
        self.password = "vvaQIZ1S"
        self.user_domain_name = "Default"
        self.session = requests.Session()
        self.verify = False
        # identity provider Id
        self.idp_id = "z420_idp"
        # service provider Id
        self.sp_id = "z620_sp"
        #self.sp_ecp_url = "https://16.103.149.44:8443/Shibboleth.sso/SAML2/ECP"
        #self.sp_auth_url = "https://16.103.149.44:8443/v3"

    def v3_authenticate(self):
        auth = v3.Password(auth_url=self.auth_url,
                           username=self.username,
                           password=self.password,
                           user_domain_name=self.user_domain_name,
                           project_name=self.project_name,
                           project_domain_name=self.project_domain_name)

        self.auth_session = session.Session(session=requests.session(),
                                       auth=auth, verify=self.verify)
        auth_ref = self.auth_session.auth.get_auth_ref(self.auth_session)
        self.token = self.auth_session.auth.get_token(self.auth_session)

    def _generate_token_json(self):
        return {
            "auth": {
                "identity": {
                    "methods": [
                        "token"
                    ],
                    "token": {
                        "id": self.token
                    }
                },
                "scope": {
                    "service_provider": {
                        "id": self.sp_id
                    }
                }
            }
        }

    def get_saml2_ecp_assertion(self):
        token = json.dumps(self._generate_token_json())
        url = self.auth_url + 'auth/OS-FEDERATION/saml2/ecp'
        r = self.session.post(url=url,
                              data=token,
                              verify=self.verify)
        if not r.ok:
            raise Exception("Something went wrong, %s" % r.__dict__)
        self.ecp_assertion = r.text

    def _get_sp_url(self):
        url = self.auth_url + 'OS-FEDERATION/service_providers/' + self.sp_id
        r = self.auth_session.get(
           url=url,
           verify=self.verify)
        if not r.ok:
            raise Exception("Something went wrong, %s" % r.__dict__)

        sp = json.loads(r.text)[u'service_provider']
        self.sp_ecp_url = sp[u'sp_url']
        self.sp_auth_url = sp[u'auth_url']

    def _handle_http_302_ecp_redirect(self, response, method, **kwargs):
        location = self.sp_auth_url + '/OS-FEDERATION/identity_providers/' + self.idp_id + '/protocols/saml2/auth'
        return self.auth_session.request(location, method, authenticated=False, **kwargs)

    def exchange_assertion(self):
        """Send assertion to a keystone SP and get token."""
        self._get_sp_url()
        print("SP ECP Url:%s" % self.sp_ecp_url)
        print("SP Auth Url:%s" % self.sp_auth_url)
        #self.sp_ecp_url = 'https://16.103.149.44:8443/Shibboleth.sso/SAML2/ECP'
        r = self.auth_session.post(
            self.sp_ecp_url,
            headers={'Content-Type': 'application/vnd.paos+xml'},
            data=self.ecp_assertion,
            authenticated=False, redirect=False)
        r = self._handle_http_302_ecp_redirect(r, 'GET',
            headers={'Content-Type': 'application/vnd.paos+xml'})
        self.fed_token_id = r.headers['X-Subject-Token']
        self.fed_token = r.text

if __name__ == "__main__":
    client = K2KClient()
    client.v3_authenticate()
    client.get_saml2_ecp_assertion()
    client.exchange_assertion()
    print('Unscoped token_id: %s' % client.fed_token_id)
    print('Unscoped token body:
%s' % client.fed_token)

5.10.2 Setting Up a keystone Provider

To set up keystone as a service provider, follow these steps.

  1. Create a config file called k2k.yml with the following parameters and place it in any directory on your Cloud Lifecycle Manager, such as /tmp.

    keystone_trusted_idp: k2k
    keystone_sp_conf:
      shib_sso_idp_entity_id: <protocol>://<idp_host>:<port>/v3/OS-FEDERATION/saml2/idp
      shib_sso_application_entity_id: http://service_provider_uri_entityId
      target_domain:
        name: domain1
        description: my domain
      target_project:
        name: project1
        description: my project
      target_group:
        name: group1
        description: my group
      role:
        name: service
      idp_metadata_file: /tmp/idp_metadata.xml
      identity_provider:
        id: my_idp_id
        description: This is the identity service provider.
      mapping:
        id: mapping1
        rules_file: /tmp/k2k_sp_mapping.json
      protocol:
        id: saml2
      attribute_map:
        -
          name: name1
          id: id1

    The following are descriptions of each of the attributes.

    AttributeDefinition
    keystone_trusted_idp

    A flag to indicate if this configuration is used for keystone-to-keystone or WebSSO. The value can be either k2k or adfs.

    keystone_sp_conf  
    shib_sso_idp_entity_id

    The identity provider URI used as an entity Id to identity the IdP. You shoud use the following value: <protocol>://<idp_host>:<port>/v3/OS-FEDERATION/saml2/idp.

    shib_sso_application_entity_id

    The service provider URI used as an entity Id. It can be any URI here for keystone-to-keystone.

    target_domain

    A domain where the group will be created.

    name

    Any domain name. If it does not exist, it will be created or updated.

    description

    Any description.

    target_project

    A project scope of the group.

    name

    Any project name. If it does not exist, it will be created or updated.

    descriptionAny description.
    target_group

    A group will be created from target_domain.

    name

    Any group name. If it does not exist, it will be created or updated.

    descriptionAny description.
    role

    A role will be assigned on target_project. This role impacts the IdP user scoped token permission on the service provider side.

    nameMust be an existing role.
    idp_metadata_file

    A reference to the IdP metadata file that validates the SAML2 assertion.

    identity_providerA supported IdP.
    id

    Any Id. If it does not exist, it will be created or updated. This Id needs to be shared with the client so that the right mapping will be selected.

    descriptionAny description.
    mapping

    A mapping in JSON format that maps a federated user to a corresponding group.

    id

    Any Id. If it does not exist, it will be created or updated.

    rules_file

    A reference to the file that has the mapping in JSON.

    protocol

    The supported federation protocol.

    id

    Security Assertion Markup Language 2.0 (SAML2) is the only supported protocol for K2K.

    attribute_map

    A shibboleth mapping that defines additional attributes to map the attributes from the SAML2 assertion to the K2K mapping that the service provider understands. K2K does not require any additional attribute mapping.

    nameAn attribute name from the SAML2 assertion.
    idAn Id that the preceding name will be mapped to.
  2. Create a metadata file that is referenced from k2k.yml, such as /tmp/idp_metadata.xml. The content of the metadata file comes from the identity provider and can be found in /etc/keystone/idp_metadata.xml.

    1. Create a mapping file that is referenced in k2k.yml, shown previously. An example is /tmp/k2k_sp_mapping.json. You can see the reference in bold in the preceding k2k.yml example. The following is an example of the mapping file.

      [
        {
          "local": [
            {
              "user": {
                "name": "{0}"
              }
            },
            {
              "group": {
                 "name": "group1",
                 "domain":{
                   "name": "domain1"
                 }
              }
            }
          ],
          "remote":[{
            "type": "openstack_user"
          },
          {
            "type": "Shib-Identity-Provider",
            "any_one_of":[
               "https://idp_host:5000/v3/OS-FEDERATION/saml2/idp"
            ]
           }
          ]
         }
      ]

      You can find more information on how the K2K mapping works at http://docs.openstack.org.

  3. Go to ~/stack/scratch/ansible/next/ardana/ansible and run the following playbook to enable the service provider:

    ardana > ansible-playbook -i hosts/verb_hosts keystone-reconfigure.yml -e@/tmp/k2k.yml

Setting Up an Identity Provider

To set up keystone as an identity provider, follow these steps:

  1. Create a config file k2k.yml with the following parameters and place it in any directory on your Cloud Lifecycle Manager, such as /tmp. Note that the certificate and key here are excerpted for space.

    keystone_k2k_idp_conf:
        service_provider:
              -
                id: my_sp_id
                description: This is service provider.
                sp_url: https://sp_host:5000
                auth_url: https://sp_host:5000/v3
        signer_cert: -----BEGIN CERTIFICATE-----
    MIIDmDCCAoACCQDS+ZDoUfr
        cIzANBgkqhkiG9w0BAQsFADCBjDELMAkGA1UEBhMC\ nVVMxEzARBgNVB
        AgMCkNhbGlmb3JuaWExEjAQBgNVBAcMCVN1bm55dmFsZTEMMAoG\
       
                ...
        nOpKEvhlMsl5I/tle
    -----END CERTIFICATE-----
        signer_key: -----BEGIN RSA PRIVATE KEY-----
    MIIEowIBAAKCAQEA1gRiHiwSO6L5PrtroHi/f17DQBOpJ1KMnS9FOHS
                
                ...

    The following are descriptions of each of the attributes under keystone_k2k_idp_conf

    service_provider

    One or more service providers can be defined. If it does not exist, it will be created or updated.

    id

    Any Id. If it does not exist, it will be created or updated. This Id needs to be shared with the client so that it knows where the service provider is.

    description

    Any description.

    sp_url

    Service provider base URL.

    auth_url

    Service provider auth URL.

    signer_cert

    Content of self-signed certificate that is embedded in the metadata file. We recommend setting the validity for a longer period of time, such as 3650 days (10 years).

    signer_key

    A private key that has a key size of 2048 bits.

  2. Create a private key and a self-signed certificate. The command-line tool, openssl, is required to generate the keys and certificates. If the system does not have it, you must install it.

    1. Create a private key of size 2048.

      ardana > openssl genrsa -out myidp.key 2048
    2. Generate a certificate request named myidp.csr. When prompted, choose CommonName for the server's hostname.

      ardana > openssl req -new -key myidp.key -out myidp.csr
    3. Generate a self-signed certificate named myidp.cer.

      ardana > openssl x509 -req -days 3650 -in myidp.csr -signkey myidp.key -out myidp.cer
  3. Go to ~/scratch/ansible/next/ardana/ansible and run the following playbook to enable the service provider in keystone:

    ardana > ansible-playbook -i hosts/verb_hosts keystone-reconfigure.yml -e@/tmp/k2k.yml

5.10.3 Test It Out

You can use the script listed earlier, k2kclient.py (Example 5.1, “k2kclient.py”), as an example for the end-to-end flows. To run k2kclient.py, follow these steps:

  1. A few parameters must be changed in the beginning of k2kclient.py. For example, enter your specific URL, project name, and user name, as follows:

    # IdP auth URL
    self.auth_url = "http://idp_host:5000/v3/"
    self.project_name = "my_project_name"
    self.project_domain_name = "my_project_domain_name"
    self.username = "test"
    self.password = "mypass"
    self.user_domain_name = "my_domain"
    # identity provider Id that is defined in the SP config
    self.idp_id = "my_idp_id"
    # service provider Id that is defined in the IdP config
    self.sp_id = "my_sp_id"
  2. Install python-keystoneclient along with its dependencies.

  3. Run the k2kclient.py script. An unscoped token will be returned from the service provider.

At this point, the domain or project scope of the unscoped taken can be discovered by sending the following URLs:

ardana > curl -k -X GET -H "X-Auth-Token: unscoped token" \
 https://<sp_public_endpoint>:5000/v3/OS-FEDERATION/domains
ardana > curl -k -X GET -H "X-Auth-Token: unscoped token" \
 https://<sp_public_endpoint:5000/v3/OS-FEDERATION/projects

5.10.4 Inside keystone-to-keystone Federation

K2K federation places a lot of responsibility with the user. The complexity is apparent from the following diagram.

  1. Users must first authenticate to their home or local cloud, or local identity provider keystone instance to obtain a scoped token.

  2. Users must discover which service providers (or remote clouds) are available to them by querying their local cloud.

  3. For a given remote cloud, users must discover which resources are available to them by querying the remote cloud for the projects they can scope to.

  4. To talk to the remote cloud, users must first exchange, with the local cloud, their locally scoped token for a SAML2 assertion to present to the remote cloud.

  5. Users then present the SAML2 assertion to the remote cloud. The remote cloud applies its mapping for the incoming SAML2 assertion to map each user to a local ephemeral persona (such as groups) and issues an unscoped token.

  6. Users query the remote cloud for the list of projects they have access to.

  7. Users then rescope their token to a given project.

  8. Users now have access to the resources owned by the project.

The following diagram illustrates the flow of authentication requests.

Keystone Authentication Flow
Figure 5.1: Keystone Authentication Flow

5.10.5 Additional Testing Scenarios

The following tests assume one identity provider and one service provider.

Test Case 1: Any federated user in the identity provider maps to a single designated group in the service provider

  1. On the identity provider side:

    hostname=myidp.com
    username=user1
  2. On the service provider side:

    group=group1
    group_domain_name=domain1
    'group1' scopes to 'project1'
  3. Mapping used:

    testcase1_1.json

    testcase1_1.json

    [
      {
        "local": [
          {
            "user": {
              "name": "{0}"
            }
          },
          {
            "group": {
               "name": "group1",
               "domain":{
                 "name": "domain1"
               }
            }
          }
        ],
        "remote":[{
          "type": "openstack_user"
        },
        {
          "type": "Shib-Identity-Provider",
          "any_one_of":[
             "https://myidp.com:5000/v3/OS-FEDERATION/saml2/idp"
          ]
         }
        ]
       }
    ]
  4. Expected result: The federated user will scope to project1.

Test Case 2: A federated user in a specific domain in the identity provider maps to two different groups in the service provider

  1. On the identity provider side:

    hostname=myidp.com
    username=user1
    user_domain_name=Default
  2. On the service provider side:

    group=group1
    group_domain_name=domain1
    'group1' scopes to 'project1' group=group2
    group_domain_name=domain2
    'group2' scopes to 'project2'
  3. Mapping used:

    testcase1_2.json

    testcase1_2.json

    [
      {
        "local": [
          {
            "user": {
              "name": "{0}"
            }
          },
          {
            "group": {
               "name": "group1",
               "domain":{
                 "name": "domain1"
               }
            }
          }
        ],
        "remote":[{
          "type": "openstack_user"
        },
        {
          "type": "Shib-Identity-Provider",
          "any_one_of":[
             "https://myidp.com:5000/v3/OS-FEDERATION/saml2/idp"
          ]
         }
        ]
       }
      {
        "local": [
          {
            "user": {
              "name": "{0}"
            }
          },
          {
            "group": {
               "name": "group2",
               "domain":{
                 "name": "domain2"
               }
            }
          }
        ],
        "remote":[{
          "type": "openstack_user"
        },
        {
          "type": "openstack_user_domain",
          "any_one_of": [
              "Default"
          ]
        },
        {
          "type": "Shib-Identity-Provider",
          "any_one_of":[
             "https://myidp.com:5000/v3/OS-FEDERATION/saml2/idp"
          ]
         }
        ]
       }
    ]
  4. Expected result: The federated user will scope to both project1 and project2.

Test Case 3: A federated user with a specific project in the identity provider maps to a specific group in the service provider

  1. On the identity provider side:

    hostname=myidp.com
    username=user4
    user_project_name=test1
  2. On the service provider side:

    group=group4
    group_domain_name=domain4
    'group4' scopes to 'project4'
  3. Mapping used:

    testcase1_3.json

    testcase1_3.json

    [
      {
        "local": [
          {
            "user": {
              "name": "{0}"
            }
          },
          {
            "group": {
               "name": "group4",
               "domain":{
                 "name": "domain4"
               }
            }
          }
        ],
        "remote":[{
          "type": "openstack_user"
        },
        {
          "type": "openstack_project",
          "any_one_of": [
              "test1"
          ]
        },
        {
          "type": "Shib-Identity-Provider",
          "any_one_of":[
             "https://myidp.com:5000/v3/OS-FEDERATION/saml2/idp"
          ]
         }
        ]
       },
      {
        "local": [
          {
            "user": {
              "name": "{0}"
            }
          },
          {
            "group": {
               "name": "group5",
               "domain":{
                 "name": "domain5"
               }
            }
          }
        ],
        "remote":[{
          "type": "openstack_user"
        },
        {
          "type": "openstack_roles",
          "not_any_of": [
              "member"
          ]
        },
        {
          "type": "Shib-Identity-Provider",
          "any_one_of":[
             "https://myidp.com:5000/v3/OS-FEDERATION/saml2/idp"
          ]
         }
        ]
       }
    ]
  4. Expected result: The federated user will scope to project4.

Test Case 4: A federated user with a specific role in the identity provider maps to a specific group in the service provider

  1. On the identity provider side:

    hostname=myidp.com, username=user5, role_name=member
  2. On the service provider side:

    group=group5, group_domain_name=domain5, 'group5' scopes to 'project5'
  3. Mapping used:

    testcase1_3.json

    testcase1_3.json

    [
      {
        "local": [
          {
            "user": {
              "name": "{0}"
            }
          },
          {
            "group": {
               "name": "group4",
               "domain":{
                 "name": "domain4"
               }
            }
          }
        ],
        "remote":[{
          "type": "openstack_user"
        },
        {
          "type": "openstack_project",
          "any_one_of": [
              "test1"
          ]
        },
        {
          "type": "Shib-Identity-Provider",
          "any_one_of":[
             "https://myidp.com:5000/v3/OS-FEDERATION/saml2/idp"
          ]
         }
        ]
       },
      {
        "local": [
          {
            "user": {
              "name": "{0}"
            }
          },
          {
            "group": {
               "name": "group5",
               "domain":{
                 "name": "domain5"
               }
            }
          }
        ],
        "remote":[{
          "type": "openstack_user"
        },
        {
          "type": "openstack_roles",
          "not_any_of": [
              "member"
          ]
        },
        {
          "type": "Shib-Identity-Provider",
          "any_one_of":[
             "https://myidp.com:5000/v3/OS-FEDERATION/saml2/idp"
          ]
         }
        ]
       }
    ]
  4. Expected result: The federated user will scope to project5.

Test Case 5: Retain the previous scope for a federated user

  1. On the identity provider side:

    hostname=myidp.com, username=user1, user_domain_name=Default
  2. On the service provider side:

    group=group1, group_domain_name=domain1, 'group1' scopes to 'project1'
  3. Mapping used:

    testcase1_1.json

    testcase1_1.json

    [
      {
        "local": [
          {
            "user": {
              "name": "{0}"
            }
          },
          {
            "group": {
               "name": "group1",
               "domain":{
                 "name": "domain1"
               }
            }
          }
        ],
        "remote":[{
          "type": "openstack_user"
        },
        {
          "type": "Shib-Identity-Provider",
          "any_one_of":[
             "https://myidp.com:5000/v3/OS-FEDERATION/saml2/idp"
          ]
         }
        ]
       }
    ]
  4. Expected result: The federated user will scope to project1. Later, we would like to scope federated users who have the default domain in the identity provider to project2 in addition to project1.

  5. On the identity provider side:

    hostname=myidp.com, username=user1, user_domain_name=Default
  6. On the service provider side:

    group=group1
    group_domain_name=domain1
    'group1' scopes to 'project1' group=group2
    group_domain_name=domain2
    'group2' scopes to 'project2'
  7. Mapping used:

    testcase1_2.json

    testcase1_2.json

    [
      {
        "local": [
          {
            "user": {
              "name": "{0}"
            }
          },
          {
            "group": {
               "name": "group1",
               "domain":{
                 "name": "domain1"
               }
            }
          }
        ],
        "remote":[{
          "type": "openstack_user"
        },
        {
          "type": "Shib-Identity-Provider",
          "any_one_of":[
             "https://myidp.com:5000/v3/OS-FEDERATION/saml2/idp"
          ]
         }
        ]
       }
      {
        "local": [
          {
            "user": {
              "name": "{0}"
            }
          },
          {
            "group": {
               "name": "group2",
               "domain":{
                 "name": "domain2"
               }
            }
          }
        ],
        "remote":[{
          "type": "openstack_user"
        },
        {
          "type": "openstack_user_domain",
          "any_one_of": [
              "Default"
          ]
        },
        {
          "type": "Shib-Identity-Provider",
          "any_one_of":[
             "https://myidp.com:5000/v3/OS-FEDERATION/saml2/idp"
          ]
         }
        ]
       }
    ]
  8. Expected result: The federated user will scope to project1 and project2.

Test Case 6: Scope a federated user to a domain

  1. On the identity provider side:

    hostname=myidp.com, username=user1
  2. On the service provider side:

    group=group1, group_domain_name=domain1, 'group1' scopes to 'project1'
  3. Mapping used:

    testcase1_1.json

    testcase1_1.json

    [
      {
        "local": [
          {
            "user": {
              "name": "{0}"
            }
          },
          {
            "group": {
               "name": "group1",
               "domain":{
                 "name": "domain1"
               }
            }
          }
        ],
        "remote":[{
          "type": "openstack_user"
        },
        {
          "type": "Shib-Identity-Provider",
          "any_one_of":[
             "https://myidp.com:5000/v3/OS-FEDERATION/saml2/idp"
          ]
         }
        ]
       }
    ]
  4. Expected result:

    • The federated user will scope to project1.

    • User uses CLI/Curl to assign any existing role to group1 on domain1.

    • User uses CLI/Curl to remove project1 scope from group1.

  5. Final result: The federated user will scope to domain1.

Test Case 7: Test five remote attributes for mapping

  1. Test all five different remote attributes, as follows, with similar test cases as noted previously.

    • openstack_user

    • openstack_user_domain

    • openstack_roles

    • openstack_project

    • openstack_project_domain

    The attribute openstack_user does not make much sense for testing because it is mapped only to a specific username. The preceding test cases have already covered the attributes openstack_user_domain, openstack_roles, and openstack_project.

Note that similar tests have also been run for two identity providers with one service provider, and for one identity provider with two service providers.

5.10.6 Known Issues and Limitations

Keep the following points in mind:

  • When a user is disabled in the identity provider, the issued federated token from the service provider still remains valid until the token is expired based on the keystone expiration setting.

  • An already issued federated token will retain its scope until its expiration. Any changes in the mapping on the service provider will not impact the scope of an already issued federated token. For example, if an already issued federated token was mapped to group1 that has scope on project1, and mapping is changed to group2 that has scope on project2, the prevously issued federated token still has scope on project1.

  • Access to service provider resources is provided only through the python-keystone CLI client or the keystone API. No horizon web interface support is currently available.

  • Domains, projects, groups, roles, and quotas are created per the service provider cloud. Support for federated projects, groups, roles, and quotas is currently not available.

  • keystone-to-keystone federation and WebSSO cannot be configured by putting both sets of configuration attributes in the same config file; they will overwrite each other. Consequently, they need to be configured individually.

  • Scoping the federated user to a domain is not supported by default in the playbook. Please follow the steps at Section 5.10.7, “Scope Federated User to Domain”.

5.10.7 Scope Federated User to Domain

Use the following steps to scope a federated user to a domain:

  1. On the IdP side, set hostname=myidp.com and username=user1.

  2. On the service provider side, set: group=group1, group_domain_name=domain1, group1 scopes to project1.

  3. Mapping used: testcase1_1.json.

    testcase1_1.json

    [
      {
        "local": [
          {
            "user": {
              "name": "{0}"
            }
          },
          {
            "group": {
               "name": "group1",
               "domain":{
                 "name": "domain1"
               }
            }
          }
        ],
        "remote":[{
          "type": "openstack_user"
        },
        {
          "type": "Shib-Identity-Provider",
          "any_one_of":[
             "https://myidp.com:5000/v3/OS-FEDERATION/saml2/idp"
          ]
         }
        ]
       }
    ]
  4. Expected result: The federated user will scope to project1. Use CLI/Curl to assign any existing role to group1 on domain1. Use CLI/Curl to remove project1 scope from group1.

  5. Result: The federated user will scope to domain1.

5.11 Configuring Web Single Sign-On

Important
Important

The external-name in ~/openstack/my_cloud/definition/data/network_groups.yml must be set to a valid DNS-resolvable FQDN.

This topic explains how to implement web single sign-on.

5.11.1 What is WebSSO?

WebSSO, or web single sign-on, is a method for web browsers to receive current authentication information from an identity provider system without requiring a user to log in again to the application displayed by the browser. Users initially access the identity provider web page and supply their credentials. If the user successfully authenticates with the identity provider, the authentication credentials are then stored in the user’s web browser and automatically provided to all web-based applications, such as the horizon dashboard in SUSE OpenStack Cloud 9. If users have not yet authenticated with an identity provider or their credentials have timed out, they are automatically redirected to the identity provider to renew their credentials.

5.11.2 Limitations

  • The WebSSO function supports only horizon web authentication. It is not supported for direct API or CLI access.

  • WebSSO works only with Fernet token provider. See Section 5.8.4, “Fernet Tokens”.

  • The SUSE OpenStack Cloud WebSSO function was tested with Microsoft Active Directory Federation Services (AD FS). The instructions provided are pertinent to ADFS and are intended to provide a sample configuration for deploying WebSSO with an external identity provider. If you have a different identity provider such as Ping Identity or IBM Tivoli, consult with those vendors for specific instructions for those products.

  • The SUSE OpenStack Cloud WebSSO function with OpenID method was tested with Google OAuth 2.0 APIs, which conform to the OpenID Connect specification. The interaction between Keystone and the external Identity Provider (IdP) is handled by the Apache2 auth_openidc module. Please consult with the specific OpenID Connect vendor on whether they support auth_openidc.

  • Both SAML and OpenID methods are supported for WebSSO federation in SUSE OpenStack Cloud 9 .

  • WebSSO has a change password option in User Settings, but note that this function is not accessible for users authenticating with external systems such as LDAP or SAML Identity Providers.

5.11.3 Enabling WebSSO

SUSE OpenStack Cloud 9 provides WebSSO support for the horizon web interface. This support requires several configuration steps including editing the horizon configuration file as well as ensuring that the correct keystone authentication configuration is enabled to receive the authentication assertions provided by the identity provider.

WebSSO support both SAML and OpenID methods. The following workflow depicts how horizon and keystone support WebSSO via SAML method if no current authentication assertion is available.

  1. horizon redirects the web browser to the keystone endpoint.

  2. keystone automatically redirects the web browser to the correct identity provider authentication web page based on the keystone configuration file.

  3. The user authenticates with the identity provider.

  4. The identity provider automatically redirects the web browser back to the keystone endpoint.

  5. keystone generates the required Javascript code to POST a token back to horizon.

  6. keystone automatically redirects the web browser back to horizon and the user can then access projects and resources assigned to the user.

The following diagram provides more details on the WebSSO authentication workflow.

Image

Note that the horizon dashboard service never talks directly to the keystone identity service until the end of the sequence, after the federated unscoped token negotiation has completed. The browser interacts with the horizon dashboard service, the keystone identity service, and ADFS on their respective public endpoints.

The following sequence of events is depicted in the diagram.

  1. The user's browser reaches the horizon dashboard service's login page. The user selects ADFS login from the drop-down menu.

  2. The horizon dashboard service issues an HTTP Redirect (301) to redirect the browser to the keystone identity service's (public) SAML2 Web SSO endpoint (/auth/OS-FEDERATION/websso/saml2). The endpoint is protected by Apache mod_shib (shibboleth).

  3. The browser talks to the keystone identity service. Because the user's browser does not have an active session with AD FS, the keystone identity service issues an HTTP Redirect (301) to the browser, along with the required SAML2 request, to the ADFS endpoint.

  4. The browser talks to AD FS. ADFS returns a login form. The browser presents it to the user.

  5. The user enters credentials (such as username and password) and submits the form to AD FS.

  6. Upon successful validation of the user's credentials, ADFS issues an HTTP Redirect (301) to the browser, along with the SAML2 assertion, to the keystone identity service's (public) SAML2 endpoint (/auth/OS-FEDERATION/websso/saml2).

  7. The browser talks to the keystone identity service. the keystone identity service validates the SAML2 assertion and issues a federated unscoped token. the keystone identity service returns JavaScript code to be executed by the browser, along with the federated unscoped token in the headers.

  8. Upon execution of the JavaScript code, the browser is redirected to the horizon dashboard service with the federated unscoped token in the header.

  9. The browser talks to the horizon dashboard service with the federated unscoped token.

  10. With the unscoped token, the horizon dashboard service talks to the keystone identity service's (internal) endpoint to get a list of projects the user has access to.

  11. The horizon dashboard service rescopes the token to the first project in the list. At this point, the user is successfully logged in.

The sequence of events for WebSSO using OpenID method is similar to SAML method.

5.11.4 Prerequisites

5.11.4.1 WebSSO Using SAML Method

5.11.4.1.1 Creating ADFS metadata

For information about creating Active Directory Federation Services metadata, see the section To create edited ADFS 2.0 metadata with an added scope element of https://technet.microsoft.com/en-us/library/gg317734.

  1. On the ADFS computer, use a browser such as Internet Explorer to view https://<adfs_server_hostname>/FederationMetadata/2007-06/FederationMetadata.xml.

  2. On the File menu, click Save as, and then navigate to the Windows desktop and save the file with the name adfs_metadata.xml. Make sure to change the Save as type drop-down box to All Files (*.*).

  3. Use Windows Explorer to navigate to the Windows desktop, right-click adfs_metadata.xml, and then click Edit.

  4. In Notepad, insert the following XML in the first element. Before editing, the EntityDescriptor appears as follows:

    <EntityDescriptor ID="abc123" entityID=http://WIN-CAICP35LF2I.vlan44.domain/adfs/services/trust xmlns="urn:oasis:names:tc:SAML:2.0:metadata" >

    After editing, it should look like this:

    <EntityDescriptor ID="abc123" entityID="http://WIN-CAICP35LF2I.vlan44.domain/adfs/services/trust" xmlns="urn:oasis:names:tc:SAML:2.0:metadata" xmlns:shibmd="urn:mace:shibboleth:metadata:1.0">
  5. In Notepad, on the Edit menu, click Find. In Find what, type IDPSSO, and then click Find Next.

  6. Insert the following XML in this section: Before editing, the IDPSSODescriptor appears as follows:

    <IDPSSODescriptor protocolSupportEnumeration="urn:oasis:names:tc:SAML:2.0:protocol"><KeyDescriptor use="encryption">

    After editing, it should look like this:

    <IDPSSODescriptor protocolSupportEnumeration="urn:oasis:names:tc:SAML:2.0:protocol"><Extensions><shibmd:Scope regexp="false">vlan44.domain</shibmd:Scope></Extensions><KeyDescriptor use="encryption">
  7. Delete the metadata document signature section of the file (the bold text shown in the following code). Because you have edited the document, the signature will now be invalid. Before editing the signature appears as follows:

    <EntityDescriptor ID="abc123" entityID="http://FSWEB.contoso.com/adfs/services/trust" xmlns="urn:oasis:names:tc:SAML:2.0:metadata" xmlns:shibmd="urn:mace:shibboleth:metadata:1.0">
    <ds:Signature xmlns:ds="http://www.w3.org/2000/09/xmldsig#">
        SIGNATURE DATA
    </ds:Signature>
    <RoleDescriptor xsi:type=…>

    After editing it should look like this:

    <EntityDescriptor ID="abc123" entityID="http://FSWEB.contoso.com/adfs/services/trust" xmlns="urn:oasis:names:tc:SAML:2.0:metadata" xmlns:shibmd="urn:mace:shibboleth:metadata:1.0">
    <RoleDescriptor xsi:type=…>
  8. Save and close adfs_metadata.xml.

  9. Copy adfs_metadata.xml to the Cloud Lifecycle Manager node and place it into /var/lib/ardana/openstack/my_cloud/config/keystone/ directory and put it under revision control.

    ardana > cd ~/openstack
    ardana > git checkout site
    ardana > git add my_cloud/config/keystone/adfs_metadata.xml
    ardana > git commit -m "Add ADFS metadata file for WebSSO authentication"
5.11.4.1.2 Setting Up WebSSO

Start by creating a config file adfs_config.yml with the following parameters and place it in the /var/lib/ardana/openstack/my_cloud/config/keystone/ directory on your Cloud Lifecycle Manager node.

keystone_trusted_idp: adfs
keystone_sp_conf:
    idp_metadata_file: /var/lib/ardana/openstack/my_cloud/config/keystone/adfs_metadata.xml
    shib_sso_application_entity_id: http://sp_uri_entityId
    shib_sso_idp_entity_id: http://default_idp_uri_entityId
    target_domain:
        name: domain1
        description: my domain
    target_project:
        name: project1
        description: my project
    target_group:
        name: group1
        description: my group
    role:
        name: service
    identity_provider:
        id: adfs_idp1
        description: This is the ADFS identity provider.
    mapping:
        id: mapping1
        rules_file: /var/lib/ardana/openstack/my_cloud/config/keystone/adfs_mapping.json
    protocol:
        id: saml2
    attribute_map:
        -
          name: http://schemas.xmlsoap.org/claims/Group
          id: ADFS_GROUP
        -
          name: urn:oid:1.3.6.1.4.1.5923.1.1.1.6
          id: ADFS_LOGIN

A sample config file like this exists in roles/KEY-API/files/samples/websso/keystone_configure_adfs_sample.yml. Here are some detailed descriptions for each of the config options:

keystone_trusted_idp: A flag to indicate if this configuration is used for WebSSO or K2K. The value can be either 'adfs' or 'k2k'.
keystone_sp_conf:
    shib_sso_idp_entity_id: The ADFS URI used as an entity Id to identity the IdP.
    shib_sso_application_entity_id: The Service Provider URI used as a entity Id. It can be any URI here for Websso as long as it is unique to the SP.
    target_domain: A domain where the group will be created from.
        name: Any domain name. If it does not exist, it will be created or be updated.
        description: Any description.
    target_project: A project scope that the group has.
        name: Any project name. If it does not exist, it will be created or be updated.
        description: Any description.
    target_group: A group will be created from 'target_domain'.
        name: Any group name. If it does not exist, it will be created or be updated.
        description: Any description.
    role: A role will be assigned on 'target_project'. This role impacts the idp user scoped token permission at sp side.
        name: It has to be an existing role.
    idp_metadata_file: A reference to the ADFS metadata file that validates the SAML2 assertion.
    identity_provider: An ADFS IdP
        id: Any Id. If it does not exist, it will be created or be updated. This Id needs to be shared with the client so that the right mapping will be selected.
        description: Any description.
    mapping: A mapping in json format that maps a federated user to a corresponding group.
        id: Any Id. If it does not exist, it will be created or be updated.
        rules_file: A reference to the file that has the mapping in json.
    protocol: The supported federation protocol.
        id: 'saml2' is the only supported protocol for Websso.
    attribute_map: A shibboleth mapping defined additional attributes to map the attributes from the SAML2 assertion to the Websso mapping that SP understands.
        -
          name: An attribute name from the SAML2 assertion.
          id: An Id that the above name will be mapped to.
  1. Create a mapping file, adfs_mapping.json, that is referenced from the preceding config file in /var/lib/ardana/openstack/my_cloud/config/keystone/.

         rules_file: /var/lib/ardana/openstack/my_cloud/config/keystone/adfs_mapping.json.

    The following is an example of the mapping file, existing in roles/KEY-API/files/samples/websso/adfs_sp_mapping.json:

    [
                 {
                   "local": [{
                         "user": {
                             "name": "{0}"
                         }
                     }],
                     "remote": [{
                         "type": "ADFS_LOGIN"
                     }]
                  },
                  {
                    "local": [{
                        "group": {
                            "id": "GROUP_ID"
                        }
                    }],
                    "remote": [{
                        "type": "ADFS_GROUP",
                    "any_one_of": [
                        "Domain Users"
                        ]
                    }]
                  }
     ]

    You can find more details about how the WebSSO mapping works at http://docs.openstack.org. Also see Section 5.11.4.1.3, “Mapping rules” for more information.

  2. Add adfs_config.yml and adfs_mapping.json to revision control.

    ardana > cd ~/openstack
    ardana > git checkout site
    ardana > git add my_cloud/config/keystone/adfs_config.yml
    ardana > git add my_cloud/config/keystone/adfs_mapping.json
    ardana > git commit -m "Add ADFS config and mapping."
  3. Go to ~/scratch/ansible/next/ardana/ansible and run the following playbook to enable WebSSO in the keystone identity service:

    ansible-playbook -i hosts/verb_hosts keystone-reconfigure.yml -e@/var/lib/ardana/openstack/my_cloud/config/keystone/adfs_config.yml
  4. Enable WebSSO in the horizon dashboard service by setting horizon_websso_enabled flag to True in roles/HZN-WEB/defaults/main.yml and then run the horizon-reconfigure playbook:

    ardana > ansible-playbook -i hosts/verb_hosts horizon-reconfigure.yml
5.11.4.1.3 Mapping rules

One IdP-SP has only one mapping. The last mapping that the customer configures will be the one used and will overwrite the old mapping setting. Therefore, if the example mapping adfs_sp_mapping.json is used, the following behavior is expected because it maps the federated user only to the one group configured in keystone_configure_adfs_sample.yml.

  • Configure domain1/project1/group1, mapping1; websso login horizon, see project1;

  • Then reconfigure: domain1/project2/group1. mapping1, websso login horizon, see project1 and project2;

  • Reconfigure: domain3/project3/group3; mapping1, websso login horizon, only see project3; because now the IDP mapping maps the federated user to group3, which only has priviliges on project3.

If you need a more complex mapping, you can use a custom mapping file, which needs to be specified in keystone_configure_adfs_sample.yml -> rules_file.

You can use different attributes of the ADFS user in order to map to different or multiple groups.

An example of a more complex mapping file is adfs_sp_mapping_multiple_groups.json, as follows.

adfs_sp_mapping_multiple_groups.json

[
  {
    "local": [
      {
        "user": {
          "name": "{0}"
        }
      },
      {
        "group": {
           "name": "group1",
           "domain":{
             "name": "domain1"
           }
        }
      }
    ],
    "remote":[{
      "type": "ADFS_LOGIN"
    },
    {
      "type": "ADFS_GROUP",
      "any_one_of":[
         "Domain Users"
      ]
     }
    ]
   },
  {
    "local": [
      {
        "user": {
          "name": "{0}"
        }
      },
      {
        "group": {
           "name": "group2",
           "domain":{
             "name": "domain2"
           }
        }
      }
    ],
    "remote":[{
      "type": "ADFS_LOGIN"
    },
    {
      "type": "ADFS_SCOPED_AFFILIATION",
      "any_one_of": [
          "member@contoso.com"
      ]
    },
    ]
   }
]

The adfs_sp_mapping_multiple_groups.json must be run together with keystone_configure_mutiple_groups_sample.yml, which adds a new attribute for the shibboleth mapping. That file is as follows:

keystone_configure_mutiple_groups_sample.yml

#
# Licensed under the Apache License, Version 2.0 (the "License"); you may
# not use this file except in compliance with the License. You may obtain
# a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
# License for the specific language governing permissions and limitations
# under the License.
#
---

keystone_trusted_idp: adfs
keystone_sp_conf:
    identity_provider:
        id: adfs_idp1
        description: This is the ADFS identity provider.
    idp_metadata_file: /var/lib/ardana/openstack/my_cloud/config/keystone/adfs_metadata.xml

    shib_sso_application_entity_id: http://blabla
    shib_sso_idp_entity_id: http://WIN-CAICP35LF2I.vlan44.domain/adfs/services/trust

    target_domain:
        name: domain2
        description: my domain

    target_project:
        name: project6
        description: my project

    target_group:
        name: group2
        description: my group

    role:
        name: admin

    mapping:
        id: mapping1
        rules_file: /var/lib/ardana/openstack/my_cloud/config/keystone/adfs_sp_mapping_multiple_groups.json

    protocol:
        id: saml2

    attribute_map:
        -
          name: http://schemas.xmlsoap.org/claims/Group
          id: ADFS_GROUP
        -
          name: urn:oid:1.3.6.1.4.1.5923.1.1.1.6
          id: ADFS_LOGIN
        -
          name: urn:oid:1.3.6.1.4.1.5923.1.1.1.9
          id: ADFS_SCOPED_AFFILIATION

5.11.4.2 Setting up the ADFS server as the identity provider

For ADFS to be able to communicate with the keystone identity service, you need to add the keystone identity service as a trusted relying party for ADFS and also specify the user attributes that you want to send to the keystone identity service when users authenticate via WebSSO.

For more information, see the Microsoft ADFS wiki, section "Step 2: Configure ADFS 2.0 as the identity provider and shibboleth as the Relying Party".

Log in to the ADFS server.

Add a relying party using metadata

  1. From Server Manager Dashboard, click Tools on the upper right, then ADFS Management.

  2. Right-click ADFS, and then select Add Relying Party Trust.

  3. Click Start, leave the already selected option Import data about the relying party published online or on a local network.

  4. In the Federation metadata address field, type <keystone_publicEndpoint>/Shibboleth.sso/Metadata (your keystone identity service Metadata endpoint), and then click Next. You can also import metadata from a file. Create a file with the content of the result of the following curl command

    curl <keystone_publicEndpoint>/Shibboleth.sso/Metadata

    and then choose this file for importing the metadata for the relying party.

  5. In the Specify Display Name page, choose a proper name to identify this trust relationship, and then click Next.

  6. On the Choose Issuance Authorization Rules page, leave the default Permit all users to access the relying party selected, and then click Next.

  7. Click Next, and then click Close.

Edit claim rules for relying party trust

  1. The Edit Claim Rules dialog box should already be open. If not, In the ADFS center pane, under Relying Party Trusts, right-click your newly created trust, and then click Edit Claim Rules.

  2. On the Issuance Transform Rules tab, click Add Rule.

  3. On the Select Rule Template page, select Send LDAP Attributes as Claims, and then click Next.

  4. On the Configure Rule page, in the Claim rule name box, type Get Data.

  5. In the Attribute Store list, select Active Directory.

  6. In the Mapping of LDAP attributes section, create the following mappings.

    LDAP AttributeOutgoing Claim Type
    Token-Groups – Unqualified NamesGroup
    User-Principal-NameUPN
  7. Click Finish.

  8. On the Issuance Transform Rules tab, click Add Rule.

  9. On the Select Rule Template page, select Send Claims Using a Custom Rule, and then click Next.

  10. In the Configure Rule page, in the Claim rule name box, type Transform UPN to epPN.

  11. In the Custom Rule window, type or copy and paste the following:

    c:[Type == "http://schemas.xmlsoap.org/ws/2005/05/identity/claims/upn"]
    => issue(Type = "urn:oid:1.3.6.1.4.1.5923.1.1.1.6", Value = c.Value, Properties["http://schemas.xmlsoap.org/ws/2005/05/identity/claimproperties/attributename"] = "urn:oasis:names:tc:SAML:2.0:attrname-format:uri");
  12. Click Finish.

  13. On the Issuance Transform Rules tab, click Add Rule.

  14. On the Select Rule Template page, select Send Claims Using a Custom Rule, and then click Next.

  15. On the Configure Rule page, in the Claim rule name box, type Transform Group to epSA.

  16. In the Custom Rule window, type or copy and paste the following:

    c:[Type == "http://schemas.xmlsoap.org/claims/Group", Value == "Domain Users"]
    => issue(Type = "urn:oid:1.3.6.1.4.1.5923.1.1.1.9", Value = "member@contoso.com", Properties["http://schemas.xmlsoap.org/ws/2005/05/identity/claimproperties/attributename"] = "urn:oasis:names:tc:SAML:2.0:attrname-format:uri");
  17. Click Finish, and then click OK.

This list of Claim Rules is just an example and can be modified or enhanced based on the customer's necessities and ADFS setup specifics.

Create a sample user on the ADFS server

  1. From the Server Manager Dashboard, click Tools on the upper right, then Active Directory Users and Computer.

  2. Right click User, then New, and then User.

  3. Follow the on-screen instructions.

You can test the horizon dashboard service "Login with ADFS" by opening a browser at the horizon dashboard service URL and choose Authenticate using: ADFS Credentials. You should be redirected to the ADFS login page and be able to log into the horizon dashboard service with your ADFS credentials.

5.11.5 WebSSO Using OpenID Method

The interaction between Keystone and the external Identity Provider (IdP) is handled by the Apache2 auth_openidc module.

There are two steps to enable the feature.

  1. Configure Keystone with the required OpenID Connect provider information.

  2. Create the Identity Provider, protocol, and mapping in Keystone, using OpenStack Command Line Tool.

5.11.5.1 Configuring Keystone

  1. Log in to the Cloud Lifecycle Manager node and edit the ~/openstack/my_cloud/config/keystone/keystone_deploy_config.yml file with the "keystone_openid_connect_conf" variable. For example:

    keystone_openid_connect_conf:
        identity_provider: google
        response_type: id_token
        scope: "openid email profile"
        metadata_url: https://accounts.google.com/.well-known/openid-configuration
        client_id: [Replace with your client ID]
        client_secret: [Replace with your client secret]
        redirect_uri: https://www.myenterprise.com:5000/v3/OS-FEDERATION/identity_providers/google/protocols/openid/auth
        crypto_passphrase: ""

    Where:

    • identity_provider: name of the OpenID Connect identity provider. This must be the same as the identity provider to be created in Keystone using OpenStack Command Line Tool. For example, if the identity provider is foo, we must create the identity provider with the name. For example:

      openstack identity provider create foo
    • response_type: corresponding to auth_openidc OIDCResponseType. In most cases, it should be "id_token".

    • scope: corresponding to auth_openidc OIDCScope.

    • metadata_url: corresponding to auth_openidc OIDCProviderMetadataURL.

    • client_id: corresponding to auth_openidc OIDCClientID.

    • client_secret: corresponding to auth_openidc OIDCClientSecret.

    • redirect_uri: corresponding to auth_openidc OIDCRedirectURI. This must be the Keystone public endpoint for given OpenID Connect identity provider. i.e. "https://keystone-public-endpoint.foo.com/v3/OS-FEDERATION/identity_providers/foo/protocols/openid/auth".

      Warning
      Warning

      Some OpenID Connect IdPs such as Google require the hostname in the "redirect_uri" to be a public FQDN. In that case, the hostname in Keystone public endpoint must also be a public FQDN and must match the one specified in the "redirect_uri".

    • crypto_passphrase: corresponding to auth_openidc OIDCCryptoPassphrase. If left blank, a random cryto passphrase will be generated.

  2. Commit the changes to your local git repository.

    cd ~/openstack/ardana/ansible
    git add -A
    git commit -m "add OpenID Connect configuration"
  3. Run keystone-reconfigure Ansible playbook.

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts keystone-reconfigure.yml

5.11.5.2 Configure Horizon

Complete the following steps to configure horizon to support WebSSO with OpenID method.

  1. Edit the ~/openstack/ardana/ansible/roles/HZN-WEB/defaults/main.yml file and set the following parameter to True.

    horizon_websso_enabled: True
  2. Locate the last line in the ~/openstack/ardana/ansible/roles/HZN-WEB/defaults/main.yml file. The default configuration for this line should look like the following:

    horizon_websso_choices:
      - {protocol: saml2, description: "ADFS Credentials"}
    • If your cloud does not have AD FS enabled, then replace the preceding horizon_websso_choices: parameter with the following.

      - {protocol: openid, description: "OpenID Connect"}

      The resulting block should look like the following.

      horizon_websso_choices:
          - {protocol: openid, description: "OpenID Connect"}
    • If your cloud does have ADFS enabled, then simply add the following parameter to the horizon_websso_choices: section. Do not replace the default parameter, add the following line to the existing block.

      - {protocol: saml2, description: "ADFS Credentials"}

      If your cloud has ADFS enabled, the final block of your ~/openstack/ardana/ansible/roles/HZN-WEB/defaults/main.yml should have the following entries.

      horizon_websso_choices:
          - {protocol: openid, description: "OpenID Connect"}
          - {protocol: saml2, description: "ADFS Credentials"}
  3. Run the following commands to add your changes to the local git repository, and reconfigure the horizon service, enabling the changes made in Step 1:

    cd ~/openstack
    git add -A
    git commit -m "Configured WebSSO using OpenID Connect"
    cd ~/openstack/ardana/ansible/
    ansible-playbook -i hosts/localhost config-processor-run.yml
    ansible-playbook -i hosts/localhost ready-deployment.yml
    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts horizon-reconfigure.yml

5.11.5.3 Create Identity Provider, Protocol, and Mapping

To fully enable OpenID Connect, Identity Provider, Protocol, and Mapping for the given IdP must be created in Keystone. This is done by using the OpenStack Command Line Tool using the Keystone admin credential.

  1. Log in to the Cloud Lifecycle Manager node and source keystone.osrc file.

    source ~/keystone.osrc
  2. Create the Identity Provider. For example:

    openstack identity provider create foo
    Warning
    Warning

    The name of the Identity Provider must be exactly the same as the "identity_provider" attribute given when configuring Keystone in the previous section.

  3. Next, create the Mapping for the Identity Provider. Prior to creating the Mapping, one must fully grasp the intricacies of Mapping Combinations as it may have profound security implications if done incorrectly. Here's an example of a mapping file.

    [
        {
            "local": [
                {
                    "user": {
                        "name": "{0}",
                        "email": "{1}",
                        "type": "ephemeral"
                     },
                     "group": {
                        "domain": {
                            "name": "Default"
                        },
                        "name": "openidc_demo"
                    }
                 }
             ],
             "remote": [
                 {
                     "type": "REMOTE_USER"
                 },
                 {
                     "type": "HTTP_OIDC_EMAIL"
                 }
    
            ]
        }
    ]

    Once the mapping file is created, now create the Mapping resource in Keystone. For example:

    openstack mapping create --rule oidc_mapping.json oidc_mapping
  4. Lastly, create the Protocol for the Identity Provider and its mapping. For OpenID Connect, the protocol name must be openid. For example:

    openstack federation protocol create --identity-provider google --mapping oidc_mapping openid

5.12 Identity Service Notes and Limitations

5.12.1 Notes

This topic describes limitations of and important notes pertaining to the identity service. Domains

  • Domains can be created and managed by the horizon web interface, keystone API and OpenStackClient CLI.

  • The configuration of external authentication systems requires the creation and usage of Domains.

  • All configurations are managed by creating and editing specific configuration files.

  • End users can authenticate to a particular project and domain via the horizon web interface, keystone API and OpenStackClient CLI.

  • A new horizon login page that requires a Domain entry is now installed by default.

keystone-to-keystone Federation

  • keystone-to-keystone (K2K) Federation provides the ability to authenticate once with one cloud and then use these credentials to access resources on other federated clouds.

  • All configurations are managed by creating and editing specific configuration files.

Multi-Factor Authentication (MFA)

  • The keystone architecture provides support for MFA deployments.

  • MFA provides the ability to deploy non-password based authentication; for example: token providing hardware and text messages.

Hierarchical Multitenancy

  • Provides the ability to create sub-projects within a Domain-Project hierarchy.

Hash Algorithm Configuration

  • The default hash algorithm is bcrypt, which has a built-in limitation of 72 characters. As keystone defaults to a secret length of 86 characters, customers may choose to change the keystone hash algorithm to one that supports the full length of their secret.

  • Process for changing the hash algorithm configuration:

    1. Update the identity section of keystone.conf.j2 to reference the desired algorithm

      [identity]
      password_hash_algorithm=pbkdf2_sha512
    2. commit the changes

    3. run the keystone-redeploy.yml playbook

      ansible-playbook -i hosts/verb_hosts keystone_redeploy.yml
    4. verify that existing users retain access by logging into Horizon

5.12.2 Limitations

Authentication with external authentication systems (LDAP, Active Directory (AD) or Identity Providers)

  • No horizon web portal support currently exists for the creation and management of external authentication system configurations.

Integration with LDAP services SUSE OpenStack Cloud 9 domain-specific configuration:

  • No Global User Listing: Once domain-specific driver configuration is enabled, listing all users and listing all groups are not supported operations. Those calls require a specific domain filter and a domain-scoped token for the target domain.

  • You cannot have both a file store and a database store for domain-specific driver configuration in a single identity service instance. Once a database store is enabled within the identity service instance, any file store will be ignored, and vice versa.

  • The identity service allows a list limit configuration to globally set the maximum number of entities that will be returned in an identity collection per request but it does not support per-domain list limit setting at this time.

  • Each time a new domain is configured with LDAP integration the single CA file gets overwritten. Ensure that you place certs for all the LDAP back-end domains in the cacert parameter. Detailed CA file inclusion instructions are provided in the comments of the sample YAML configuration file keystone_configure_ldap_my.yml (see Section 5.9.2, “Set up domain-specific driver configuration - file store”).

  • LDAP is only supported for identity operations (reading users and groups from LDAP).

  • keystone assignment operations from LDAP records such as managing or assigning roles and projects, are not currently supported.

  • The SUSE OpenStack Cloud 'default' domain is pre-configured to store service account users and is authenticated locally against the identity service. Domains configured for external LDAP integration are non-default domains.

  • When using the current OpenStackClient CLI you must use the user ID rather than the user name when working with a non-default domain.

  • Each LDAP connection with the identity service is for read-only operations. Configurations that require identity service write operations (to create users, groups, etc.) are not currently supported.

  • LDAP is only supported for identity operations (reading users and groups from LDAP). keystone assignment operations from LDAP records such as managing or assigning roles and projects, are not currently supported.

  • When using the current OpenStackClient CLI you must use the user ID rather than the user name when working with a non-default domain.

SUSE OpenStack Cloud 9 API-based domain-specific configuration management

  • No GUI dashboard for domain-specific driver configuration management

  • API-based Domain specific config does not check for type of option.

  • API-based Domain specific config does not check for option values supported.

  • API-based Domain config method does not provide retrieval of default values of domain-specific configuration options.

  • Status: Domain-specific driver configuration database store is a non-core feature for SUSE OpenStack Cloud 9.

5.12.3 keystone-to-keystone federation

  • When a user is disabled in the identity provider, the issued federated token from the service provider still remains valid until the token is expired based on the keystone expiration setting.

  • An already issued federated token will retain its scope until its expiration. Any changes in the mapping on the service provider will not impact the scope of an already issued federated token. For example, if an already issued federated token was mapped to group1 that has scope on project1, and mapping is changed to group2 that has scope on project2, the prevously issued federated token still has scope on project1.

  • Access to service provider resources is provided only through the python-keystone CLI client or the keystone API. No horizon web interface support is currently available.

  • Domains, projects, groups, roles, and quotas are created per the service provider cloud. Support for federated projects, groups, roles, and quotas is currently not available.

  • keystone-to-keystone federation and WebSSO cannot be configured by putting both sets of configuration attributes in the same config file; they will overwrite each other. Consequently, they need to be configured individually.

  • Scoping the federated user to a domain is not supported by default in the playbook. To enable it, see the steps in Section 5.10.7, “Scope Federated User to Domain”.

  • No horizon web portal support currently exists for the creation and management of federation configurations.

  • All end user authentication is available only via the keystone API and OpenStackClient CLI.

  • Additional information can be found at http://docs.openstack.org.

WebSSO

  • The WebSSO function supports only horizon web authentication. It is not supported for direct API or CLI access.

  • WebSSO works only with Fernet token provider. See Section 5.8.4, “Fernet Tokens”.

  • The SUSE OpenStack Cloud WebSSO function with SAML method was tested with Microsoft Active Directory Federation Services (ADFS). The instructions provided are pertinent to ADFS and are intended to provide a sample configuration for deploying WebSSO with an external identity provider. If you have a different identity provider such as Ping Identity or IBM Tivoli, consult with those vendors for specific instructions for those products.

  • The SUSE OpenStack Cloud WebSSO function with OpenID method was tested with Google OAuth 2.0 APIs, which conform to the OpenID Connect specification. The interaction between keystone and the external Identity Provider (IdP) is handled by the Apache2 auth_openidc module. Please consult with the specific OpenID Connect vendor on whether they support auth_openidc

  • Both SAML and OpenID methods are supported for WebSSO federation in SUSE OpenStack Cloud 9 .

  • WebSSO has a change password option in User Settings, but note that this function is not accessible for users authenticating with external systems such as LDAP or SAML Identity Providers.

Multi-factor authentication (MFA)

Hierarchical multitenancy

Missing quota information for compute resources

Note
Note

An error message that will appear in the default horizon page if you are running a swift-only deployment (no Compute service). In this configuration, you will not see any quota information for Compute resources and will see the following error message:

The Compute service is not installed or is not configured properly. No information is available for Compute resources. This error message is expected as no Compute service is configured for this deployment. Please ignore the error message.

The following is the benchmark of the performance that is based on 150 concurrent requests and run for 10 minute periods of stable load time.

Operation In SUSE OpenStack Cloud 9 (secs/request)In SUSE OpenStack Cloud 9 3.0 (secs/request)
Token Creation 0.860.42
Token Validation0.470.41

Considering that token creation operations do not happen as frequently as token validation operations, you are likely to experience less of a performance problem regardless of the extended time for token creation.

5.12.4 System cron jobs need setup

keystone relies on two cron jobs to periodically clean up expired tokens and for token revocation. The following is how the cron jobs appear on the system:

1 1 * * * /opt/stack/service/keystone/venv/bin/keystone-manage token_flush
1 1,5,10,15,20 * * * /opt/stack/service/keystone/venv/bin/revocation_cleanup.sh

By default, the two cron jobs are enabled on controller node 1 only, not on the other two nodes. When controller node 1 is down or has failed for any reason, these two cron jobs must be manually set up on one of the other two nodes.

6 Managing Compute

Information about managing and configuring the Compute service.

6.1 Managing Compute Hosts using Aggregates and Scheduler Filters

OpenStack nova has the concepts of availability zones and host aggregates that enable you to segregate your compute hosts. Availability zones are used to specify logical separation within your cloud based on the physical isolation or redundancy you have set up. Host aggregates are used to group compute hosts together based upon common features, such as operation system. For more information, read this topic.

OpenStack nova has the concepts of availability zones and host aggregates that enable you to segregate your Compute hosts. Availability zones are used to specify logical separation within your cloud based on the physical isolation or redundancy you have set up. Host aggregates are used to group compute hosts together based upon common features, such as operation system. For more information, see Scaling and Segregating your Cloud.

The nova scheduler also has a filter scheduler, which supports both filtering and weighting to make decisions on where new compute instances should be created. For more information, see Filter Scheduler and Scheduling.

This document is going to show you how to set up both a nova host aggregate and configure the filter scheduler to further segregate your compute hosts.

6.1.1 Creating a nova Aggregate

These steps will show you how to create a nova aggregate and how to add a compute host to it. You can run these steps on any machine that contains the OpenStackClient that also has network access to your cloud environment. These requirements are met by the Cloud Lifecycle Manager.

  1. Log in to the Cloud Lifecycle Manager.

  2. Source the administrative creds:

    ardana > source ~/service.osrc
  3. List your current nova aggregates:

    ardana > openstack aggregate list
  4. Create a new nova aggregate with this syntax:

    ardana > openstack aggregate create AGGREGATE-NAME

    If you wish to have the aggregate appear as an availability zone, then specify an availability zone with this syntax:

    ardana > openstack aggregate create AGGREGATE-NAME AVAILABILITY-ZONE-NAME

    So, for example, if you wish to create a new aggregate for your SUSE Linux Enterprise compute hosts and you wanted that to show up as the SLE availability zone, you could use this command:

    ardana > openstack aggregate create SLE SLE

    This would produce an output similar to this:

    +----+------+-------------------+-------+------------------+
    | Id | Name | Availability Zone | Hosts | Metadata
    +----+------+-------------------+-------+--------------------------+
    | 12 | SLE  | SLE               |       | 'availability_zone=SLE'
    +----+------+-------------------+-------+--------------------------+
  5. Next, you need to add compute hosts to this aggregate so you can start by listing your current hosts. You can view the current list of hosts running running the compute service like this:

    ardana > openstack hypervisor list
  6. You can then add host(s) to your aggregate with this syntax:

    ardana > openstack aggregate add host AGGREGATE-NAME HOST
  7. Then you can confirm that this has been completed by listing the details of your aggregate:

    openstack aggregate show AGGREGATE-NAME

    You can also list out your availability zones using this command:

    ardana > openstack availability zone list

6.1.2 Using nova Scheduler Filters

The nova scheduler has two filters that can help with differentiating between different compute hosts that we'll describe here.

FilterDescription
AggregateImagePropertiesIsolation

Isolates compute hosts based on image properties and aggregate metadata. You can use commas to specify multiple values for the same property. The filter will then ensure at least one value matches.

AggregateInstanceExtraSpecsFilter

Checks that the aggregate metadata satisfies any extra specifications associated with the instance type. This uses aggregate_instance_extra_specs

Note
Note

For details about other available filters, see Filter Scheduler.

Using the AggregateImagePropertiesIsolation Filter

  1. Log in to the Cloud Lifecycle Manager.

  2. Edit the ~/openstack/my_cloud/config/nova/nova.conf.j2 file and add AggregateImagePropertiesIsolation to the scheduler_filters section. Example below, in bold:

    # Scheduler
    ...
    scheduler_available_filters = nova.scheduler.filters.all_filters
    scheduler_default_filters = AvailabilityZoneFilter,RetryFilter,ComputeFilter,
     DiskFilter,RamFilter,ImagePropertiesFilter,ServerGroupAffinityFilter,
     ServerGroupAntiAffinityFilter,ComputeCapabilitiesFilter,NUMATopologyFilter,
     AggregateImagePropertiesIsolation
    ...

    Optionally, you can also add these lines:

    aggregate_image_properties_isolation_namespace = <a prefix string>
    aggregate_image_properties_isolation_separator = <a separator character>

    (defaults to .)

    If these are added, the filter will only match image properties starting with the name space and separator - for example, setting to my_name_space and : would mean the image property my_name_space:image_type=SLE matches metadata image_type=SLE, but an_other=SLE would not be inspected for a match at all.

    If these are not added all image properties will be matched against any similarly named aggregate metadata.

  3. Add image properties to images that should be scheduled using the above filter

  4. Commit the changes to git:

    ardana > git add -A
    ardana > git commit -a -m "editing nova schedule filters"
  5. Run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  6. Run the ready deployment playbook:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  7. Run the nova reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts nova-reconfigure.yml

Using the AggregateInstanceExtraSpecsFilter Filter

  1. Log in to the Cloud Lifecycle Manager.

  2. Edit the ~/openstack/my_cloud/config/nova/nova.conf.j2 file and add AggregateInstanceExtraSpecsFilter to the scheduler_filters section. Example below, in bold:

    # Scheduler
    ...
    scheduler_available_filters = nova.scheduler.filters.all_filters
     scheduler_default_filters = AvailabilityZoneFilter,RetryFilter,ComputeFilter,
     DiskFilter,RamFilter,ImagePropertiesFilter,ServerGroupAffinityFilter,
     ServerGroupAntiAffinityFilter,ComputeCapabilitiesFilter,NUMATopologyFilter,
     AggregateInstanceExtraSpecsFilter
    ...
  3. There is no additional configuration needed because the following is true:

    1. The filter assumes : is a separator

    2. The filter will match all simple keys in extra_specs plus all keys with a separator if the prefix is aggregate_instance_extra_specs - for example, image_type=SLE and aggregate_instance_extra_specs:image_type=SLE will both be matched against aggregate metadata image_type=SLE

  4. Add extra_specs to flavors that should be scheduled according to the above.

  5. Commit the changes to git:

    ardana > git add -A
    ardana > git commit -a -m "Editing nova scheduler filters"
  6. Run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  7. Run the ready deployment playbook:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  8. Run the nova reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts novan-reconfigure.yml

6.2 Using Flavor Metadata to Specify CPU Model

Libvirt is a collection of software used in OpenStack to manage virtualization. It has the ability to emulate a host CPU model in a guest VM. In SUSE OpenStack Cloud nova, the ComputeCapabilitiesFilter limits this ability by checking the exact CPU model of the compute host against the requested compute instance model. It will only pick compute hosts that have the cpu_model requested by the instance model, and if the selected compute host does not have that cpu_model, the ComputeCapabilitiesFilter moves on to find another compute host that matches, if possible. Selecting an unavailable vCPU model may cause nova to fail with no valid host found.

To assist, there is a nova scheduler filter that captures cpu_models as a subset of a particular CPU family. The filter determines if the host CPU model is capable of emulating the guest CPU model by maintaining the mapping of the vCPU models and comparing it with the host CPU model.

There is a limitation when a particular cpu_model is specified with hw:cpu_model via a compute flavor: the cpu_mode will be set to custom. This mode ensures that a persistent guest virtual machine will see the same hardware no matter what host physical machine the guest virtual machine is booted on. This allows easier live migration of virtual machines. Because of this limitation, only some of the features of a CPU are exposed to the guest. Requesting particular CPU features is not supported.

6.2.1 Editing the flavor metadata in the horizon dashboard

These steps can be used to edit a flavor's metadata in the horizon dashboard to add the extra_specs for a cpu_model:

  1. Access the horizon dashboard and log in with admin credentials.

  2. Access the Flavors menu by (A) clicking on the menu button, (B) navigating to the Admin section, and then (C) clicking on Flavors:

    Image
  3. In the list of flavors, choose the flavor you wish to edit and click on the entry under the Metadata column:

    Image
    Note
    Note

    You can also create a new flavor and then choose that one to edit.

  4. In the Custom field, enter hw:cpu_model and then click on the + (plus) sign to continue:

    Image
  5. Then you will want to enter the CPU model into the field that you wish to use and then click Save:

    Image

6.3 Forcing CPU and RAM Overcommit Settings

SUSE OpenStack Cloud supports overcommitting of CPU and RAM resources on compute nodes. Overcommitting is a technique of allocating more virtualized CPUs and/or memory than there are physical resources.

The default settings for this are:

SettingDefault ValueDescription
cpu_allocation_ratio16

Virtual CPU to physical CPU allocation ratio which affects all CPU filters. This configuration specifies a global ratio for CoreFilter. For AggregateCoreFilter, it will fall back to this configuration value if no per-aggregate setting found.

Note
Note

This can be set per-compute, or if set to 0.0, the value set on the scheduler node(s) will be used and defaulted to 16.0.

ram_allocation_ratio1.0

Virtual RAM to physical RAM allocation ratio which affects all RAM filters. This configuration specifies a global ratio for RamFilter. For AggregateRamFilter, it will fall back to this configuration value if no per-aggregate setting found.

Note
Note

This can be set per-compute, or if set to 0.0, the value set on the scheduler node(s) will be used and defaulted to 1.5.

disk_allocation_ratio1.0

This is the virtual disk to physical disk allocation ratio used by the disk_filter.py script to determine if a host has sufficient disk space to fit a requested instance. A ratio greater than 1.0 will result in over-subscription of the available physical disk, which can be useful for more efficiently packing instances created with images that do not use the entire virtual disk,such as sparse or compressed images. It can be set to a value between 0.0 and 1.0 in order to preserve a percentage of the disk for uses other than instances.

Note
Note

This can be set per-compute, or if set to 0.0, the value set on the scheduler node(s) will be used and defaulted to 1.0.

6.3.1 Changing the overcommit ratios for your entire environment

If you wish to change the CPU and/or RAM overcommit ratio settings for your entire environment then you can do so via your Cloud Lifecycle Manager with these steps.

  1. Log in to the Cloud Lifecycle Manager.

  2. Edit the nova configuration settings located in this file:

    ~/openstack/my_cloud/config/nova/nova.conf.j2
  3. Add or edit the following lines to specify the ratios you wish to use:

    cpu_allocation_ratio = 16
    ram_allocation_ratio = 1.0
  4. Commit your configuration to the Git repository (Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”), as follows:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "setting nova overcommit settings"
  5. Run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  6. Update your deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  7. Run the nova reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts nova-reconfigure.yml

6.4 Enabling the Nova Resize and Migrate Features

The nova resize and migrate features are disabled by default. If you wish to utilize these options, these steps will show you how to enable it in your cloud.

The two features below are disabled by default:

These two features are disabled by default because they require passwordless SSH access between Compute hosts with the user having access to the file systems to perform the copy.

6.4.1 Enabling Nova Resize and Migrate

If you wish to enable these features, use these steps on your lifecycle manager. This will deploy a set of public and private SSH keys to the Compute hosts, allowing the nova user SSH access between each of your Compute hosts.

  1. Log in to the Cloud Lifecycle Manager.

  2. Run the nova reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts nova-reconfigure.yml --extra-vars nova_migrate_enabled=true
  3. To ensure that the resize and migration options show up in the horizon dashboard, run the horizon reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts horizon-reconfigure.yml

6.4.2 Disabling Nova Resize and Migrate

This feature is disabled by default. However, if you have previously enabled it and wish to re-disable it, you can use these steps on your lifecycle manager. This will remove the set of public and private SSH keys that were previously added to the Compute hosts, removing the nova users SSH access between each of your Compute hosts.

  1. Log in to the Cloud Lifecycle Manager.

  2. Run the nova reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts nova-reconfigure.yml --extra-vars nova_migrate_enabled=false
  3. To ensure that the resize and migrate options are removed from the horizon dashboard, run the horizon reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts horizon-reconfigure.yml

6.5 Enabling ESX Compute Instance(s) Resize Feature

The resize of ESX compute instance is disabled by default. If you want to utilize this option, these steps will show you how to configure and enable it in your cloud.

The following feature is disabled by default:

  • Resize - this feature allows you to change the size of a Compute instance by changing its flavor. See the OpenStack User Guide for more details on its use.

6.5.1 Procedure

If you want to configure and re-size ESX compute instance(s), perform the following steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Edit the ~ /openstack/my_cloud/config/nova/nova.conf.j2 to add the following parameter under Policy:

    # Policy
    allow_resize_to_same_host=True
  3. Commit your configuration:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "<commit message>"
  4. Run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  5. Update your deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml

    By default the nova resize feature is disabled. To enable nova resize, refer to Section 6.4, “Enabling the Nova Resize and Migrate Features”.

    By default an ESX console log is not set up. For more details about Hypervisor setup, refer to the OpenStack documentation.

6.6 GPU passthrough

GPU passthrough for SUSE OpenStack Cloud provides the nova instance direct access to the GPU device for increased performance.

This section demonstrates the steps to pass through a Nvidia GPU card supported by SUSE OpenStack Cloud,

Note
Note

Resizing the VM to the same host with the same PCI card is not supported with PCI passthrough.

The following steps are necessary to leverage PCI passthrough on a SUSE OpenStack Cloud 9 Compute Node: preparing the Compute Node, preparing nova via the input model updates and glance. Ensure you follow the below procedures in sequence:

Procedure 6.1: Preparing the Compute Node
  1. There should be no kernel drivers or binaries with direct access to the PCI device. If there are kernel modules, ensure they are blacklisted.

    For example, it is common to have a nouveau driver from when the node was installed. This driver is a graphics driver for Nvidia-based GPUs. It must be blacklisted as shown in this example:

    ardana > echo 'blacklist nouveau' >> /etc/modprobe.d/nouveau-default.conf

    The file location and its contents are important, however the name of the file is your choice. Other drivers can be blacklisted in the same manner, including Nvidia drivers.

  2. On the host, iommu_groups is necessary and may already be enabled. To check if IOMMU is enabled, run the following commands:

    root #  virt-host-validate
            .....
            QEMU: Checking if IOMMU is enabled by kernel
            : WARN (IOMMU appears to be disabled in kernel. Add intel_iommu=on to kernel cmdline arguments)
            .....

    To modify the kernel command line as suggested in the warning, edit /etc/default/grub and append intel_iommu=on to the GRUB_CMDLINE_LINUX_DEFAULT variable. Run:

    root #  update-bootloader

    Reboot to enable iommu_groups.

  3. After the reboot, check that IOMMU is enabled:

    root # virt-host-validate
            .....
            QEMU: Checking if IOMMU is enabled by kernel
            : PASS
            .....
  4. Confirm IOMMU groups are available by finding the group associated with your PCI device (for example Nvidia GPU):

    ardana > lspci -nn | grep -i nvidia
            84:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev
            a1)

    In this example, 84:00.0 is the address of the PCI device. The vendorID is 10de. The product ID is 1db4.

  5. Confirm that the devices are available for passthrough:

    ardana > ls -ld /sys/kernel/iommu_groups/*/devices/*84:00.?/
            drwxr-xr-x 3 root root 0 Nov 19 17:00 /sys/kernel/iommu_groups/56/devices/0000:84:00.0/

6.6.1 Preparing nova via the input model updates

To implement the required configuration, log into the Cloud Lifecycle Manager node and update the Cloud Lifecycle Manager model files to enable GPU passthrough for compute nodes.

Edit servers.yml

Add the pass-through section after the definition of servers section in the servers.yml file. The following example shows only the relevant sections:

        ---
        product:
        version: 2

        baremetal:
        netmask: 255.255.255.0
        subnet: 192.168.100.0


        servers:
        .
        .
        .
        .

          - id: compute-0001
            ip-addr: 192.168.75.5
            role: COMPUTE-ROLE
            server-group: RACK3
            nic-mapping: HP-DL360-4PORT
            ilo-ip: ****
            ilo-user: ****
            ilo-password: ****
            mac-addr: ****
          .
          .
          .

          - id: compute-0008
            ip-addr: 192.168.75.7
            role: COMPUTE-ROLE
            server-group: RACK2
            nic-mapping: HP-DL360-4PORT
            ilo-ip: ****
            ilo-user: ****
            ilo-password: ****
            mac-addr: ****

        pass-through:
          servers:
            - id: compute-0001
              data:
                gpu:
                  - vendor_id: 10de
                    product_id: 1db4
                    bus_address: 0000:84:00.0
                    pf_mode: type-PCI
                    name: a1
                  - vendor_id: 10de
                    product_id: 1db4
                    bus_address: 0000:85:00.0
                    pf_mode: type-PCI
                    name: b1
            - id: compute-0008
              data:
                gpu:
                  - vendor_id: 10de
                    product_id: 1db4
                    pf_mode: type-PCI
                    name: c1
  1. Check out the site branch of the local git repository and change to the correct directory:

    ardana > cd ~/openstack
            ardana > git checkout site
            ardana > cd ~/openstack/my_cloud/definition/data/
  2. Open the file containing the servers list, for example servers.yml, with your chosen editor. Save the changes to the file and commit to the local git repository:

    ardana > git add -A

    Confirm that the changes to the tree are relevant changes and commit:

    ardana > git status
            ardana > git commit -m "your commit message goes here in quotes"
  3. Enable your changes by running the necessary playbooks:

    ardana > cd ~/openstack/ardana/ansible
            ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
            ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
            ardana > cd ~/scratch/ansible/next/ardana/ansible

    If you are enabling GPU passthrough for your compute nodes during your initial installation, run the following command:

    ardana > ansible-playbook -i hosts/verb_hosts site.yml

    If you are enabling GPU passthrough for your compute nodes post-installation, run the following command:

    ardana > ansible-playbook -i hosts/verb_hosts nova-reconfigure.yml

The above procedure updates the configuration for the nova api, nova compute and scheduler as defined in https://docs.openstack.org/nova/rocky/admin/pci-passthrough.html.

The following is the PCI configuration for the compute0001 node using the above example post-playbook run:

        [pci]
        passthrough_whitelist = [{"address": "0000:84:00.0"}, {"address": "0000:85:00.0"}]
        alias = {"vendor_id": "10de", "name": "a1", "device_type": "type-PCI", "product_id": "1db4"}
        alias = {"vendor_id": "10de", "name": "b1", "device_type": "type-PCI", "product_id": "1db4"}

The following is the PCI configuration for compute0008 node using the above example post-playbook run:

        [pci]
        passthrough_whitelist = [{"vendor_id": "10de", "product_id": "1db4"}]
        alias = {"vendor_id": "10de", "name": "c1", "device_type": "type-PCI", "product_id": "1db4"}
Note
Note

After running the site.yml playbook above, reboot the compute nodes that are configured with Intel PCI devices.

6.6.2 Create a flavor

For GPU passthrough, set the pci_passthrough:alias property. You can do so for an existing flavor or create a new flavor as shown in the example below:

        ardana > openstack flavor create --ram 8192 --disk 100 --vcpu 8 gpuflavor
        ardana > openstack flavor set gpuflavor --property "pci_passthrough:alias"="a1:1"

Here the a1 references the alias name as provided in the model while the 1 tells nova that a single GPU should be assigned.

Boot an instance using the flavor created above:

         ardana > openstack server create --flavor gpuflavor --image sles12sp4 --key-name key --nic net-id=$net_id gpu-instance-1

6.7 Configuring the Image Service

The Image service, based on OpenStack glance, works out of the box and does not need any special configuration. However, we show you how to enable glance image caching as well as how to configure your environment to allow the glance copy-from feature if you choose to do so. A few features detailed below will require some additional configuration if you choose to use them.

Warning
Warning

glance images are assigned IDs upon creation, either automatically or specified by the user. The ID of an image should be unique, so if a user assigns an ID which already exists, a conflict (409) will occur.

This only becomes a problem if users can publicize or share images with others. If users can share images AND cannot publicize images then your system is not vulnerable. If the system has also been purged (via glance-manage db purge) then it is possible for deleted image IDs to be reused.

If deleted image IDs can be reused then recycling of public and shared images becomes a possibility. This means that a new (or modified) image can replace an old image, which could be malicious.

If this is a problem for you, please contact Sales Engineering.

6.7.1 How to enable glance image caching

In SUSE OpenStack Cloud 9, by default, the glance image caching option is not enabled. You have the option to have image caching enabled and these steps will show you how to do that.

The main benefits to using image caching is that it will allow the glance service to return the images faster and it will cause less load on other services to supply the image.

In order to use the image caching option you will need to supply a logical volume for the service to use for the caching.

If you wish to use the glance image caching option, you will see the section below in your ~/openstack/my_cloud/definition/data/disks_controller.yml file. You will specify the mount point for the logical volume you wish to use for this.

  1. Log in to the Cloud Lifecycle Manager.

  2. Edit your ~/openstack/my_cloud/definition/data/disks_controller.yml file and specify the volume and mount point for your glance-cache. Here is an example:

    # glance cache: if a logical volume with consumer usage glance-cache
    # is defined glance caching will be enabled. The logical volume can be
    # part of an existing volume group or a dedicated volume group.
     - name: glance-vg
       physical-volumes:
         - /dev/sdx
       logical-volumes:
         - name: glance-cache
           size: 95%
           mount: /var/lib/glance/cache
           fstype: ext4
           mkfs-opts: -O large_file
           consumer:
             name: glance-api
             usage: glance-cache

    If you are enabling image caching during your initial installation, prior to running site.yml the first time, then continue with the installation steps. However, if you are making this change post-installation then you will need to commit your changes with the steps below.

  3. Commit your configuration to the Git repository (Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”), as follows:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  4. Run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  5. Update your deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  6. Run the glance reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts glance-reconfigure.yml

An existing volume image cache is not properly deleted when cinder detects the source image has changed. After updating any source image, delete the cache volume so that the cache is refreshed.

The volume image cache must be deleted before trying to use the associated source image in any other volume operations. This includes creating bootable volumes or booting an instance with create volume enabled and the updated image as the source image.

6.7.2 Allowing the glance copy-from option in your environment

When creating images, one of the options you have is to copy the image from a remote location to your local glance store. You do this by specifying the --copy-from option when creating the image. To use this feature though you need to ensure the following conditions are met:

  • The server hosting the glance service must have network access to the remote location that is hosting the image.

  • There cannot be a proxy between glance and the remote location.

  • The glance v1 API must be enabled, as v2 does not currently support the copy-from function.

  • The http glance store must be enabled in the environment, following the steps below.

Enabling the HTTP glance Store

  1. Log in to the Cloud Lifecycle Manager.

  2. Edit the ~/openstack/my_cloud/config/glance/glance-api.conf.j2 file and add http to the list of glance stores in the [glance_store] section as seen below in bold:

    [glance_store]
    stores = {{ glance_stores }}, http
  3. Commit your configuration to the Git repository (Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”), as follows:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  4. Run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  5. Update your deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  6. Run the glance reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts glance-reconfigure.yml
  7. Run the horizon reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts horizon-reconfigure.yml

7 Managing ESX

Information about managing and configuring the ESX service.

7.1 Networking for ESXi Hypervisor (OVSvApp)

To provide the network as a service for tenant VM's hosted on ESXi Hypervisor, a service VM called OVSvApp VM is deployed on each ESXi Hypervisor within a cluster managed by OpenStack nova, as shown in the following figure.

Image

The OVSvApp VM runs SLES as a guest operating system, and has Open vSwitch 2.1.0 or above installed. It also runs an agent called OVSvApp agent, which is responsible for dynamically creating the port groups for the tenant VMs and manages OVS bridges, which contain the flows related to security groups and L2 networking.

To facilitate fault tolerance and mitigation of data path loss for tenant VMs, run the neutron-ovsvapp-agent-monitor process as part of the neutron-ovsvapp-agent service, responsible for monitoring the Open vSwitch module within the OVSvApp VM. It also uses a nginx server to provide the health status of the Open vSwitch module to the neutron server for mitigation actions. There is a mechanism to keep the neutron-ovsvapp-agent service alive through a systemd script.

When a OVSvApp Service VM crashes, an agent monitoring mechanism starts a cluster mitigation process. You can mitigate data path traffic loss for VMs on the failed ESX host in that cluster by putting the failed ESX host in the maintenance mode. This, in turn, triggers the vCenter DRS migrates tenant VMs to other ESX hosts within the same cluster. This ensures data path continuity of tenant VMs traffic.

View Cluster Mitigation

Important
Important

Install python-networking-vsphere so that neutron ovsvapp commands will work properly.

ardana > sudo zypper in python-networking-vsphere

An administrator can view cluster mitigation status using the following commands.

  • neutron ovsvapp-mitigated-cluster-list

    Lists all the clusters where at least one round of host mitigation has happened.

    Example:

    ardana > neutron ovsvapp-mitigated-cluster-list
    +----------------+--------------+-----------------------+---------------------------+
    | vcenter_id     | cluster_id   | being_mitigated       | threshold_reached         |
    +----------------+--------------+-----------------------+---------------------------+
    | vcenter1       | cluster1     | True                  | False                     |
    | vcenter2       | cluster2     | False                 | True                      |
    +---------------+------------+-----------------+------------------------------------+
  • neutron ovsvapp-mitigated-cluster-show --vcenter-id <VCENTER_ID> --cluster-id <CLUSTER_ID>

    Shows the status of a particular cluster.

    Example :

    ardana > neutron ovsvapp-mitigated-cluster-show --vcenter-id vcenter1 --cluster-id cluster1
    +---------------------------+-------------+
    | Field                     | Value       |
    +---------------------------+-------------+
    | being_mitigated           | True        |
    | cluster_id                | cluster1    |
    | threshold_reached         | False       |
    | vcenter_id                | vcenter1    |
    +---------------------------+-------------+

    There can be instances where a triggered mitigation may not succeed and the neutron server is not informed of such failure (for example, if the selected agent which had to mitigate the host, goes down before finishing the task). In this case, the cluster will be locked. To unlock the cluster for further mitigations, use the update command.

  • neutron ovsvapp-mitigated-cluster-update --vcenter-id <VCENTER_ID> --cluster-id <CLUSTER_ID>

    • Update the status of a mitigated cluster:

      Modify the values of being-mitigated from True to False to unlock the cluster.

      Example:

      ardana > neutron ovsvapp-mitigated-cluster-update --vcenter-id vcenter1 --cluster-id cluster1 --being-mitigated False
    • Update the threshold value:

      Update the threshold-reached value to True, if no further migration is required in the selected cluster.

      Example :

      ardana > neutron ovsvapp-mitigated-cluster-update --vcenter-id vcenter1 --cluster-id cluster1 --being-mitigated False --threshold-reached True

    Rest API

    • ardana > curl -i -X GET http://<ip>:9696/v2.0/ovsvapp_mitigated_clusters \
        -H "User-Agent: python-neutronclient" -H "Accept: application/json" -H \
        "X-Auth-Token: <token_id>"

7.1.1 More Information

For more information on the Networking for ESXi Hypervisor (OVSvApp), see the following references:

7.2 Validating the neutron Installation

You can validate that the ESX compute cluster is added to the cloud successfully using the following command:

# openstack network agent list

+------------------+----------------------+-----------------------+-------------------+-------+----------------+---------------------------+
| id               | agent_type           | host                  | availability_zone | alive | admin_state_up | binary                    |
+------------------+----------------------+-----------------------+-------------------+-------+----------------+---------------------------+
| 05ca6ef...999c09 | L3 agent             | doc-cp1-comp0001-mgmt | nova              | :-)   | True           | neutron-l3-agent          |
| 3b9179a...28e2ef | Metadata agent       | doc-cp1-comp0001-mgmt |                   | :-)   | True           | neutron-metadata-agent    |
| 4e8f84f...c9c58f | Metadata agent       | doc-cp1-comp0002-mgmt |                   | :-)   | True           | neutron-metadata-agent    |
| 55a5791...c17451 | L3 agent             | doc-cp1-c1-m1-mgmt    | nova              | :-)   | True           | neutron-vpn-agent         |
| 5e3db8f...87f9be | Open vSwitch agent   | doc-cp1-c1-m1-mgmt    |                   | :-)   | True           | neutron-openvswitch-agent |
| 6968d9a...b7b4e9 | L3 agent             | doc-cp1-c1-m2-mgmt    | nova              | :-)   | True           | neutron-vpn-agent         |
| 7b02b20...53a187 | Metadata agent       | doc-cp1-c1-m2-mgmt    |                   | :-)   | True           | neutron-metadata-agent    |
| 8ece188...5c3703 | Open vSwitch agent   | doc-cp1-comp0002-mgmt |                   | :-)   | True           | neutron-openvswitch-agent |
| 8fcb3c7...65119a | Metadata agent       | doc-cp1-c1-m1-mgmt    |                   | :-)   | True           | neutron-metadata-agent    |
| 9f48967...36effe | OVSvApp agent        | doc-cp1-comp0002-mgmt |                   | :-)   | True           | ovsvapp-agent             |
| a2a0b78...026da9 | Open vSwitch agent   | doc-cp1-comp0001-mgmt |                   | :-)   | True           | neutron-openvswitch-agent |
| a2fbd4a...28a1ac | DHCP agent           | doc-cp1-c1-m2-mgmt    | nova              | :-)   | True           | neutron-dhcp-agent        |
| b2428d5...ee60b2 | DHCP agent           | doc-cp1-c1-m1-mgmt    | nova              | :-)   | True           | neutron-dhcp-agent        |
| c0983a6...411524 | Open vSwitch agent   | doc-cp1-c1-m2-mgmt    |                   | :-)   | True           | neutron-openvswitch-agent |
| c32778b...a0fc75 | L3 agent             | doc-cp1-comp0002-mgmt | nova              | :-)   | True           | neutron-l3-agent          |
+------------------+----------------------+-----------------------+-------------------+-------+----------------+---------------------------+

7.3 Removing a Cluster from the Compute Resource Pool

7.3.1 Prerequisites

Write down the Hostname and ESXi configuration IP addresses of OVSvAPP VMs of that ESX cluster before deleting the VMs. These IP address and Hostname will be used to cleanup monasca alarm definitions.

Perform the following steps:

  1. Login to vSphere client.

  2. Select the ovsvapp node running on each ESXi host and click Summary tab as shown in the following example.

    Image

    Similarly you can retrieve the compute-proxy node information.

    Image

7.3.2 Removing an existing cluster from the compute resource pool

Perform the following steps to remove an existing cluster from the compute resource pool.

  1. Run the following command to check for the instances launched in that cluster:

    # openstack server list --host <hostname>
    +--------------------------------------+------+--------+------------+-------------+------------------+
    | ID                                   | Name | Status | Task State | Power State | Networks         |
    +--------------------------------------+------+--------+------------+-------------+------------------+
    | 80e54965-758b-425e-901b-9ea756576331 | VM1  | ACTIVE | -          | Running     | private=10.0.0.2 |
    +--------------------------------------+------+--------+------------+-------------+------------------+

    where:

    • hostname: Specifies hostname of the compute proxy present in that cluster.

  2. Delete all instances spawned in that cluster:

    # openstack server delete <server> [<server ...>]

    where:

    • server: Specifies the name or ID of server (s)

    OR

    Migrate all instances spawned in that cluster.

    # openstack server migrate <server>
  3. Run the following playbooks for stop the Compute (nova) and Networking (neutron) services:

    ardana > ansible-playbook -i hosts/verb_hosts nova-stop --limit <hostname>;
    ardana > ansible-playbook -i hosts/verb_hosts neutron-stop --limit <hostname>;

    where:

    • hostname: Specifies hostname of the compute proxy present in that cluster.

7.3.3 Cleanup monasca-agent for OVSvAPP Service

Perform the following procedure to cleanup monasca agents for ovsvapp-agent service.

  1. If monasca-API is installed on different node, copy the service.orsc from Cloud Lifecycle Manager to monasca API server.

    scp service.orsc $USER@ardana-cp1-mtrmon-m1-mgmt:
  2. SSH to monasca API server. You must SSH to each monasca API server for cleanup.

    For example:

    ssh ardana-cp1-mtrmon-m1-mgmt
  3. Edit /etc/monasca/agent/conf.d/host_alive.yaml file to remove the reference to the OVSvAPP you removed. This requires sudo access.

    sudo vi /etc/monasca/agent/conf.d/host_alive.yaml

    A sample of host_alive.yaml:

    - alive_test: ping
      built_by: HostAlive
      host_name: esx-cp1-esx-ovsvapp0001-mgmt
      name: esx-cp1-esx-ovsvapp0001-mgmt ping
      target_hostname: esx-cp1-esx-ovsvapp0001-mgmt

    where HOST_NAME and TARGET_HOSTNAME is mentioned at the DNS name field at the vSphere client. (Refer to Section 7.3.1, “Prerequisites”).

  4. After removing the reference on each of the monasca API servers, restart the monasca-agent on each of those servers by executing the following command.

    tux > sudo service openstack-monasca-agent restart
  5. With the OVSvAPP references removed and the monasca-agent restarted, you can delete the corresponding alarm to complete the cleanup process. We recommend using the monasca CLI which is installed on each of your monasca API servers by default. Execute the following command from the monasca API server (for example: ardana-cp1-mtrmon-mX-mgmt).

    monasca alarm-list --metric-name host_alive_status --metric-dimensions hostname=<ovsvapp deleted>

    For example: You can execute the following command to get the alarm ID, if the OVSvAPP appears as a preceding example.

    monasca alarm-list --metric-name host_alive_status --metric-dimensions hostname=MCP-VCP-cpesx-esx-ovsvapp0001-mgmt
    +--------------------------------------+--------------------------------------+-----------------------+-------------------+-------------------------------------------+----------+-------+-----------------+------+--------------------------+--------------------------+--------------------------+
    | id                                   | alarm_definition_id                  | alarm_definition_name | metric_name       | metric_dimensions                         | severity | state | lifecycle_state | link | state_updated_timestamp  | updated_timestamp        | created_timestamp        |
    +--------------------------------------+--------------------------------------+-----------------------+-------------------+-------------------------------------------+----------+-------+-----------------+------+--------------------------+--------------------------+--------------------------+
    | cfc6bfa4-2485-4319-b1e5-0107886f4270 | cca96c53-a927-4b0a-9bf3-cb21d28216f3 | Host Status           | host_alive_status | service: system                           | HIGH     | OK    | None            | None | 2016-10-27T06:33:04.256Z | 2016-10-27T06:33:04.256Z | 2016-10-23T13:41:57.258Z |
    |                                      |                                      |                       |                   | cloud_name: entry-scale-kvm-esx-mml       |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                       |                   | test_type: ping                           |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                       |                   | hostname: ardana-cp1-esx-ovsvapp0001-mgmt |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                       |                   | control_plane: control-plane-1            |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                       |                   | cluster: mtrmon                           |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                       |                   | observer_host: ardana-cp1-mtrmon-m1-mgmt  |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                       | host_alive_status | service: system                           |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                       |                   | cloud_name: entry-scale-kvm-esx-mml       |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                       |                   | test_type: ping                           |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                       |                   | hostname: ardana-cp1-esx-ovsvapp0001-mgmt |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                       |                   | control_plane: control-plane-1            |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                       |                   | cluster: mtrmon                           |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                       |                   | observer_host: ardana-cp1-mtrmon-m3-mgmt  |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                       | host_alive_status | service: system                           |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                       |                   | cloud_name: entry-scale-kvm-esx-mml       |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                       |                   | test_type: ping                           |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                       |                   | hostname: ardana-cp1-esx-ovsvapp0001-mgmt |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                       |                   | control_plane: control-plane-1            |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                       |                   | cluster: mtrmon                           |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                       |                   | observer_host: ardana-cp1-mtrmon-m2-mgmt  |          |       |                 |      |                          |                          |                          |
    +--------------------------------------+--------------------------------------+-----------------------+-------------------+-------------------------------------------+----------+-------+-----------------+------+--------------------------+--------------------------+--------------------------+
  6. Delete the monasca alarm.

    monasca alarm-delete <alarm ID>

    For example:

    monasca alarm-delete cfc6bfa4-2485-4319-b1e5-0107886f4270Successfully deleted alarm

    After deleting the alarms and updating the monasca-agent configuration, those alarms will be removed from the Operations Console UI. You can login to Operations Console and view the status.

7.3.4 Removing the Compute Proxy from Monitoring

Once you have removed the Compute proxy, the alarms against them will still trigger. Therefore to resolve this, you must perform the following steps.

  1. SSH to monasca API server. You must SSH to each monasca API server for cleanup.

    For example:

    ssh ardana-cp1-mtrmon-m1-mgmt
  2. Edit /etc/monasca/agent/conf.d/host_alive.yaml file to remove the reference to the Compute proxy you removed. This requires sudo access.

    sudo vi /etc/monasca/agent/conf.d/host_alive.yaml

    A sample of host_alive.yaml file.

    - alive_test: ping
      built_by: HostAlive
      host_name: MCP-VCP-cpesx-esx-comp0001-mgmt
      name: MCP-VCP-cpesx-esx-comp0001-mgmt ping
  3. Once you have removed the references on each of your monasca API servers, execute the following command to restart the monasca-agent on each of those servers.

    tux > sudo service openstack-monasca-agent restart
  4. With the Compute proxy references removed and the monasca-agent restarted, delete the corresponding alarm to complete this process. complete the cleanup process. We recommend using the monasca CLI which is installed on each of your monasca API servers by default.

    monasca alarm-list --metric-dimensions hostname= <compute node deleted>

    For example: You can execute the following command to get the alarm ID, if the Compute proxy appears as a preceding example.

    monasca alarm-list --metric-dimensions hostname=ardana-cp1-comp0001-mgmt
  5. Delete the monasca alarm

    monasca alarm-delete <alarm ID>

7.3.5 Cleaning the monasca Alarms Related to ESX Proxy and vCenter Cluster

Perform the following procedure:

  1. Using the ESX proxy hostname, execute the following command to list all alarms.

    monasca alarm-list --metric-dimensions hostname=COMPUTE_NODE_DELETED

    where COMPUTE_NODE_DELETED - hostname is taken from the vSphere client (refer to Section 7.3.1, “Prerequisites”).

    Note
    Note

    Make a note of all the alarm IDs that are displayed after executing the preceding command.

    For example, the compute proxy hostname is MCP-VCP-cpesx-esx-comp0001-mgmt.

    monasca alarm-list --metric-dimensions hostname=MCP-VCP-cpesx-esx-comp0001-mgmt
    ardana@R28N6340-701-cp1-c1-m1-mgmt:~$ monasca alarm-list --metric-dimensions hostname=R28N6340-701-cp1-esx-comp0001-mgmt
    +--------------------------------------+--------------------------------------+------------------------+------------------------+--------------------------------------------------+----------+-------+-----------------+------+--------------------------+--------------------------+--------------------------+
    | id                                   | alarm_definition_id                  | alarm_definition_name  | metric_name            | metric_dimensions                                | severity | state | lifecycle_state | link | state_updated_timestamp  | updated_timestamp        | created_timestamp        |
    +--------------------------------------+--------------------------------------+------------------------+------------------------+--------------------------------------------------+----------+-------+-----------------+------+--------------------------+--------------------------+--------------------------+
    | 02342bcb-da81-40db-a262-09539523c482 | 3e302297-0a36-4f0e-a1bd-03402b937a4e | HTTP Status            | http_status            | service: compute                                 | HIGH     | OK    | None            | None | 2016-11-11T06:58:11.717Z | 2016-11-11T06:58:11.717Z | 2016-11-10T08:55:45.136Z |
    |                                      |                                      |                        |                        | cloud_name: entry-scale-esx-kvm                  |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                        |                        | url: https://10.244.209.9:8774                   |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                        |                        | hostname: R28N6340-701-cp1-esx-comp0001-mgmt     |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                        |                        | component: nova-api                              |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                        |                        | control_plane: control-plane-1                   |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                        |                        | cluster: esx-compute                             |          |       |                 |      |                          |                          |                          |
    | 04cb36ce-0c7c-4b4c-9ebc-c4011e2f6c0a | 15c593de-fa54-4803-bd71-afab95b980a4 | Disk Usage             | disk.space_used_perc   | mount_point: /proc/sys/fs/binfmt_misc            | HIGH     | OK    | None            | None | 2016-11-10T08:52:52.886Z | 2016-11-10T08:52:52.886Z | 2016-11-10T08:51:29.197Z |
    |                                      |                                      |                        |                        | service: system                                  |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                        |                        | cloud_name: entry-scale-esx-kvm                  |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                        |                        | hostname: R28N6340-701-cp1-esx-comp0001-mgmt     |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                        |                        | control_plane: control-plane-1                   |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                        |                        | cluster: esx-compute                             |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                        |                        | device: systemd-1                                |          |       |                 |      |                          |                          |                          |
    +--------------------------------------+--------------------------------------+------------------------+------------------------+--------------------------------------------------+----------+-------+-----------------+------+--------------------------+--------------------------+--------------------------+
  2. Delete the alarm using the alarm IDs.

    monasca alarm-delete <alarm ID>

    Perform this step for all alarm IDs listed from the preceding step (Step 1).

    For example:

    monasca alarm-delete 1cc219b1-ce4d-476b-80c2-0cafa53e1a12

7.4 Removing an ESXi Host from a Cluster

This topic describes how to remove an existing ESXi host from a cluster and clean up of services for OVSvAPP VM.

Note
Note

Before performing this procedure, wait until VCenter migrates all the tenant VMs to other active hosts in that same cluster.

7.4.1 Prerequisite

Write down the Hostname and ESXi configuration IP addresses of OVSvAPP VMs of that ESX cluster before deleting the VMs. These IP address and Hostname will be used to clean up monasca alarm definitions.

  1. Login to vSphere client.

  2. Select the ovsvapp node running on the ESXi host and click Summary tab.

7.4.2 Procedure

  1. Right-click and put the host in the maintenance mode. This will automatically migrate all the tenant VMs except OVSvApp.

    Image
  2. Cancel the maintenance mode task.

  3. Right-click the ovsvapp VM (IP Address) node, select Power, and then click Power Off.

    Image
  4. Right-click the node and then click Delete from Disk.

    Image
  5. Right-click the Host, and then click Enter Maintenance Mode.

  6. Disconnect the VM. Right-click the VM, and then click Disconnect.

    Image

The ESXi node is removed from the vCenter.

7.4.3 Clean up neutron-agent for OVSvAPP Service

After removing ESXi node from a vCenter, perform the following procedure to clean up neutron agents for ovsvapp-agent service.

  1. Login to Cloud Lifecycle Manager.

  2. Source the credentials.

    ardana > source service.osrc
  3. Execute the following command.

    ardana > openstack network agent list | grep <OVSvapp hostname>

    For example:

    ardana > openstack network agent list | grep MCP-VCP-cpesx-esx-ovsvapp0001-mgmt
    | 92ca8ada-d89b-43f9-b941-3e0cd2b51e49 | OVSvApp Agent      | MCP-VCP-cpesx-esx-ovsvapp0001-mgmt |                   | :-)   | True           | ovsvapp-agent             |
  4. Delete the OVSvAPP agent.

    ardana > openstack network agent delete <Agent -ID>

    For example:

    ardana > openstack network agent delete 92ca8ada-d89b-43f9-b941-3e0cd2b51e49

If you have more than one host, perform the preceding procedure for all the hosts.

7.4.4 Clean up monasca-agent for OVSvAPP Service

Perform the following procedure to clean up monasca agents for ovsvapp-agent service.

  1. If monasca-API is installed on different node, copy the service.orsc from Cloud Lifecycle Manager to monasca API server.

    ardana > scp service.orsc $USER@ardana-cp1-mtrmon-m1-mgmt:
  2. SSH to monasca API server. You must SSH to each monasca API server for cleanup.

    For example:

    ardana > ssh ardana-cp1-mtrmon-m1-mgmt
  3. Edit /etc/monasca/agent/conf.d/host_alive.yaml file to remove the reference to the OVSvAPP you removed. This requires sudo access.

    sudo vi /etc/monasca/agent/conf.d/host_alive.yaml

    A sample of host_alive.yaml:

    - alive_test: ping
      built_by: HostAlive
      host_name: MCP-VCP-cpesx-esx-ovsvapp0001-mgmt
      name: MCP-VCP-cpesx-esx-ovsvapp0001-mgmt ping
      target_hostname: MCP-VCP-cpesx-esx-ovsvapp0001-mgmt

    where host_name and target_hostname are mentioned at the DNS name field at the vSphere client. (Refer to Section 7.4.1, “Prerequisite”).

  4. After removing the reference on each of the monasca API servers, restart the monasca-agent on each of those servers by executing the following command.

    tux > sudo service openstack-monasca-agent restart
  5. With the OVSvAPP references removed and the monasca-agent restarted, you can delete the corresponding alarm to complete the cleanup process. We recommend using the monasca CLI which is installed on each of your monasca API servers by default. Execute the following command from the monasca API server (for example: ardana-cp1-mtrmon-mX-mgmt).

    ardana > monasca alarm-list --metric-name host_alive_status --metric-dimensions hostname=<ovsvapp deleted>

    For example: You can execute the following command to get the alarm ID, if the OVSvAPP appears as a preceding example.

    ardana > monasca alarm-list --metric-name host_alive_status --metric-dimensions hostname=MCP-VCP-cpesx-esx-ovsvapp0001-mgmt
    +--------------------------------------+--------------------------------------+-----------------------+-------------------+-------------------------------------------+----------+-------+-----------------+------+--------------------------+--------------------------+--------------------------+
    | id                                   | alarm_definition_id                  | alarm_definition_name | metric_name       | metric_dimensions                         | severity | state | lifecycle_state | link | state_updated_timestamp  | updated_timestamp        | created_timestamp        |
    +--------------------------------------+--------------------------------------+-----------------------+-------------------+-------------------------------------------+----------+-------+-----------------+------+--------------------------+--------------------------+--------------------------+
    | cfc6bfa4-2485-4319-b1e5-0107886f4270 | cca96c53-a927-4b0a-9bf3-cb21d28216f3 | Host Status           | host_alive_status | service: system                           | HIGH     | OK    | None            | None | 2016-10-27T06:33:04.256Z | 2016-10-27T06:33:04.256Z | 2016-10-23T13:41:57.258Z |
    |                                      |                                      |                       |                   | cloud_name: entry-scale-kvm-esx-mml       |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                       |                   | test_type: ping                           |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                       |                   | hostname: ardana-cp1-esx-ovsvapp0001-mgmt |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                       |                   | control_plane: control-plane-1            |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                       |                   | cluster: mtrmon                           |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                       |                   | observer_host: ardana-cp1-mtrmon-m1-mgmt  |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                       | host_alive_status | service: system                           |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                       |                   | cloud_name: entry-scale-kvm-esx-mml       |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                       |                   | test_type: ping                           |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                       |                   | hostname: ardana-cp1-esx-ovsvapp0001-mgmt |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                       |                   | control_plane: control-plane-1            |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                       |                   | cluster: mtrmon                           |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                       |                   | observer_host: ardana-cp1-mtrmon-m3-mgmt  |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                       | host_alive_status | service: system                           |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                       |                   | cloud_name: entry-scale-kvm-esx-mml       |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                       |                   | test_type: ping                           |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                       |                   | hostname: ardana-cp1-esx-ovsvapp0001-mgmt |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                       |                   | control_plane: control-plane-1            |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                       |                   | cluster: mtrmon                           |          |       |                 |      |                          |                          |                          |
    |                                      |                                      |                       |                   | observer_host: ardana-cp1-mtrmon-m2-mgmt  |          |       |                 |      |                          |                          |                          |
    +--------------------------------------+--------------------------------------+-----------------------+-------------------+-------------------------------------------+----------+-------+-----------------+------+--------------------------+--------------------------+--------------------------+
  6. Delete the monasca alarm.

    ardana > monasca alarm-delete <alarm ID>

    For example:

    ardana > monasca alarm-delete cfc6bfa4-2485-4319-b1e5-0107886f4270Successfully deleted alarm

    After deleting the alarms and updating the monasca-agent configuration, those alarms will be removed from the Operations Console UI. You can login to Operations Console and view the status.

7.4.5 Clean up the entries of OVSvAPP VM from /etc/host

Perform the following procedure to clean up the entries of OVSvAPP VM from /etc/hosts.

  1. Login to Cloud Lifecycle Manager.

  2. Edit /etc/host.

    ardana > vi /etc/host

    For example: MCP-VCP-cpesx-esx-ovsvapp0001-mgmt VM is present in the /etc/host.

    192.168.86.17    MCP-VCP-cpesx-esx-ovsvapp0001-mgmt
  3. Delete the OVSvAPP entries from /etc/hosts.

7.4.6 Remove the OVSVAPP VM from the servers.yml and pass_through.yml files and run the Configuration Processor

Complete these steps from the Cloud Lifecycle Manager to remove the OVSvAPP VM:

  1. Log in to the Cloud Lifecycle Manager

  2. Edit servers.yml file to remove references to the OVSvAPP VM(s) you want to remove:

    ~/openstack/my_cloud/definition/data/servers.yml

    For example:

    - ip-addr:192.168.86.17
      server-group: AZ1    role:
      OVSVAPP-ROLE    id:
      6afaa903398c8fc6425e4d066edf4da1a0f04388
  3. Edit ~/openstack/my_cloud/definition/data/pass_through.yml file to remove the OVSvAPP VM references using the server-id above section to find the references.

    - data:
      vmware:
      vcenter_cluster: Clust1
      cluster_dvs_mapping: 'DC1/host/Clust1:TRUNK-DVS-Clust1'
      esx_hostname: MCP-VCP-cpesx-esx-ovsvapp0001-mgmt
      vcenter_id: 0997E2ED9-5E4F-49EA-97E6-E2706345BAB2
    id: 6afaa903398c8fc6425e4d066edf4da1a0f04388
  4. Commit the changes to git:

    ardana > git commit -a -m "Remove ESXi host <name>"
  5. Run the configuration processor. You may want to use the remove_deleted_servers and free_unused_addresses switches to free up the resources when running the configuration processor. See Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 7 “Other Topics”, Section 7.3 “Persisted Data” for more details.

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml -e remove_deleted_servers="y" -e free_unused_addresses="y"
  6. Update your deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml

7.4.7 Clean Up nova Agent for ESX Proxy

  1. Log in to the Cloud Lifecycle Manager

  2. Source the credentials.

    ardana > source service.osrc
  3. Find the nova ID for ESX Proxy with openstack compute service list.

  4. Delete the ESX Proxy service.

    ardana > openstack compute service delete
        ESX_PROXY_ID

If you have more than one host, perform the preceding procedure for all the hosts.

7.4.8 Clean Up monasca Agent for ESX Proxy

  1. Using the ESX proxy hostname, execute the following command to list all alarms.

    ardana > monasca alarm-list --metric-dimensions hostname=COMPUTE_NODE_DELETED

    where COMPUTE_NODE_DELETED - hostname is taken from the vSphere client (refer to Section 7.3.1, “Prerequisites”).

    Note
    Note

    Make a note of all the alarm IDs that are displayed after executing the preceding command.

  2. Delete the ESX Proxy alarm using the alarm IDs.

    monasca alarm-delete <alarm ID>

    This step has to be performed for all alarm IDs listed with the monasca alarm-list command.

7.4.9 Clean Up ESX Proxy Entries in /etc/host

  1. Log in to the Cloud Lifecycle Manager

  2. Edit the /etc/hosts file, removing ESX Proxy entries.

7.4.10 Remove ESX Proxy from servers.yml and pass_through.yml files; run the Configuration Processor

  1. Log in to the Cloud Lifecycle Manager

  2. Edit servers.yml file to remove references to ESX Proxy:

    ~/openstack/my_cloud/definition/data/servers.yml
  3. Edit ~/openstack/my_cloud/definition/data/pass_through.yml file to remove the ESX Proxy references using the server-id fromm the servers.yml file.

  4. Commit the changes to git:

    git commit -a -m "Remove ESX Proxy references"
  5. Run the configuration processor. You may want to use the remove_deleted_servers and free_unused_addresses switches to free up the resources when running the configuration processor. See Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 7 “Other Topics”, Section 7.3 “Persisted Data” for more details.

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml \
    -e remove_deleted_servers="y" -e free_unused_addresses="y"
  6. Update your deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml

7.4.11 Remove Distributed Resource Scheduler (DRS) Rules

Perform the following procedure to remove DRS rules, which is added by OVSvAPP installer to ensure that OVSvAPP does not get migrated to other hosts.

  1. Login to vCenter.

  2. Right click on cluster and select Edit settings.

    Image

    A cluster settings page appears.

  3. Click DRS Groups Manager on the left hand side of the pop-up box. Select the group which is created for deleted OVSvAPP and click Remove.

    Image
  4. Click Rules on the left hand side of the pop-up box and select the checkbox for deleted OVSvAPP and click Remove.

    Image
  5. Click OK.

7.5 Configuring Debug Logging

7.5.1 To Modify the OVSVAPP VM Log Level

To change the OVSVAPP log level to DEBUG, do the following:

  1. Log in to the Cloud Lifecycle Manager.

  2. Edit the file below:

    ~/openstack/ardana/ansible/roles/neutron-common/templates/ovsvapp-agent-logging.conf.j2
  3. Set the logging level value of the logger_root section to DEBUG, like this:

    [logger_root]
    qualname: root
    handlers: watchedfile, logstash
    level: DEBUG
  4. Commit your configuration to the Git repository (Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”), as follows:

    cd ~/openstack/ardana/ansible
    git add -A
    git commit -m "My config or other commit message"
  5. Run the configuration processor:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost config-processor-run.yml
  6. Update your deployment directory:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost ready-deployment.yml
  7. Deploy your changes:

    cd ~/scratch/ansible/next/hos/ansible
    ansible-playbook -i hosts/verb_hosts neutron-reconfigure.yml

7.5.2 To Enable OVSVAPP Service for Centralized Logging

To enable OVSVAPP Service for centralized logging:

  1. Log in to the Cloud Lifecycle Manager.

  2. Edit the file below:

    ~/openstack/my_cloud/config/logging/vars/neutron-ovsvapp-clr.yml
  3. Set the value of centralized_logging to true as shown in the following sample:

    logr_services:
      neutron-ovsvapp:
        logging_options:
        - centralized_logging:
            enabled: true
            format: json
            ...
  4. Commit your configuration to the Git repository (Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”), as follows:

    cd ~/openstack/ardana/ansible
    git add -A
    git commit -m "My config or other commit message"
  5. Run the configuration processor:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost config-processor-run.yml
  6. Update your deployment directory:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost ready-deployment.yml
  7. Deploy your changes, specifying the hostname for your OVSAPP host:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts neutron-reconfigure.yml --limit <hostname>

    The hostname of the node can be found in the list generated from the output of the following command:

    grep hostname ~/openstack/my_cloud/info/server_info.yml

7.6 Making Scale Configuration Changes

This procedure describes how to make the recommended configuration changes to achieve 8,000 virtual machine instances.

Note
Note

In a scale environment for ESX computes, the configuration of vCenter Proxy VM has to be increased to 8 vCPUs and 16 GB RAM. By default it is 4 vCPUs and 4 GB RAM.

  1. Change the directory. The nova.conf.j2 file is present in following directories:

    cd ~/openstack/ardana/ansible/roles/nova-common/templates
  2. Edit the DEFAULT section in the nova.conf.j2 file as below:

    [DEFAULT]
    rpc_responce_timeout = 180
    server_down_time = 300
    report_interval = 30
  3. Commit your configuration:

    cd ~/openstack/ardana/ansible
    git add -A
    git commit -m "<commit message>"
  4. Prepare your environment for deployment:

    ansible-playbook -i hosts/localhost ready-deployment.yml;
    cd ~/scratch/ansible/next/ardana/ansible;
  5. Execute the nova-reconfigure playbook:

    ansible-playbook -i hosts/verb_hosts nova-reconfigure.yml

7.7 Monitoring vCenter Clusters

Remote monitoring of activated ESX cluster is enabled through vCenter Plugin of monasca. The monasca-agent running in each ESX Compute proxy node is configured with the vcenter plugin, to monitor the cluster.

Alarm definitions are created with the default threshold values and whenever the threshold limit breaches respective alarms (OK/ALARM/UNDETERMINED) are generated.

The configuration file details is given below:

init_config: {}
instances:
  - vcenter_ip: <vcenter-ip>
      username: <vcenter-username>
      password: <center-password>
      clusters: <[cluster list]>

Metrics List of metrics posted to monasca by vCenter Plugin are listed below:

  • vcenter.cpu.total_mhz

  • vcenter.cpu.used_mhz

  • vcenter.cpu.used_perc

  • vcenter.cpu.total_logical_cores

  • vcenter.mem.total_mb

  • vcenter.mem.used_mb

  • vcenter.mem.used_perc

  • vcenter.disk.total_space_mb

  • vcenter.disk.total_used_space_mb

  • vcenter.disk.total_used_space_perc

monasca measurement-list --dimensions esx_cluster_id=domain-c7.D99502A9-63A8-41A2-B3C3-D8E31B591224 vcenter.disk.total_used_space_mb 2016-08-30T11:20:08

+----------------------------------------------+----------------------------------------------------------------------------------------------+-----------------------------------+------------------+-----------------+
| name                                         | dimensions                                                                                   | timestamp                         | value            | value_meta      |
+----------------------------------------------+----------------------------------------------------------------------------------------------+-----------------------------------+------------------+-----------------+
| vcenter.disk.total_used_space_mb             | vcenter_ip: 10.1.200.91                                                                      | 2016-08-30T11:20:20.703Z          | 100371.000       |                 |
|                                              | esx_cluster_id: domain-c7.D99502A9-63A8-41A2-B3C3-D8E31B591224                               | 2016-08-30T11:20:50.727Z          | 100371.000       |                 |
|                                              | hostname: MCP-VCP-cpesx-esx-comp0001-mgmt                                                    | 2016-08-30T11:21:20.707Z          | 100371.000       |                 |
|                                              |                                                                                              | 2016-08-30T11:21:50.700Z          | 100371.000       |                 |
|                                              |                                                                                              | 2016-08-30T11:22:20.700Z          | 100371.000       |                 |
|                                              |                                                                                              | 2016-08-30T11:22:50.700Z          | 100371.000       |                 |
|                                              |                                                                                              | 2016-08-30T11:23:20.620Z          | 100371.000       |                 |
+----------------------------------------------+-----------------------------------------------------------------------------------------------+-----------------------------------+------------------+-----------------+

Dimensions

Each metric will have the dimension as below

vcenter_ip

FQDN/IP Address of the registered vCenter

server esx_cluster_id

clusterName.vCenter-id, as seen in the openstack hypervisor list

hostname

ESX compute proxy name

Alarms

Alarms are created for monitoring cpu, memory and disk usages for each activated clusters. The alarm definitions details are

NameExpressionSeverityMatch_by
ESX cluster CPU Usageavg(vcenter.cpu.used_perc) > 90 times 3Highesx_cluster_id
ESX cluster Memory Usageavg(vcenter.mem.used_perc) > 90 times 3Highesx_cluster_id
ESX cluster Disk Usagevcenter.disk.total_used_space_perc > 90Highesx_cluster_id

7.8 Monitoring Integration with OVSvApp Appliance

7.8.1 Processes Monitored with monasca-agent

Using the monasca agent, the following services are monitored on the OVSvApp appliance:

  • neutron_ovsvapp_agent service - This is the neutron agent which runs in the appliance which will help enable networking for the tenant virtual machines.

  • Openvswitch - This service is used by the neutron_ovsvapp_agent service for enabling the datapath and security for the tenant virtual machines.

  • Ovsdb-server - This service is used by the neutron_ovsvapp_agent service.

If any of the above three processes fail to run on the OVSvApp appliance it will lead to network disruption for the tenant virtual machines. This is why they are monitored.

The monasca-agent periodically reports the status of these processes and metrics data ('load' - cpu.load_avg_1min, 'process' - process.pid_count, 'memory' - mem.usable_perc, 'disk' - disk.space_used_perc, 'cpu' - cpu.idle_perc for examples) to the monasca server.

7.8.2 How It Works

Once the vApp is configured and up, the monasca-agent will attempt to register with the monasca server. After successful registration, the monitoring begins on the processes listed above and you will be able to see status updates on the server side.

The monasca-agent monitors the processes at the system level so, in the case of failures of any of the configured processes, updates should be seen immediately from monasca.

To check the events from the server side, log into the Operations Console.

8 Managing Block Storage

Information about managing and configuring the Block Storage service.

8.1 Managing Block Storage using Cinder

SUSE OpenStack Cloud Block Storage volume operations use the OpenStack cinder service to manage storage volumes, which includes creating volumes, attaching/detaching volumes to nova instances, creating volume snapshots, and configuring volumes.

SUSE OpenStack Cloud supports the following storage back ends for block storage volumes and backup datastore configuration:

  • Volumes

    • SUSE Enterprise Storage; for more information, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 35 “Integrations”, Section 35.3 “SUSE Enterprise Storage Integration”.

    • 3PAR FC or iSCSI; for more information, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 35 “Integrations”, Section 35.1 “Configuring for 3PAR Block Storage Backend”.

  • Backup

    • swift

8.1.1 Setting Up Multiple Block Storage Back-ends

SUSE OpenStack Cloud supports setting up multiple block storage backends and multiple volume types.

Whether you have a single or multiple block storage back-ends defined in your cinder.conf.j2 file, you can create one or more volume types using the specific attributes associated with the back-end. You can find details on how to do that for each of the supported back-end types here:

  • Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 35 “Integrations”, Section 35.3 “SUSE Enterprise Storage Integration”

  • Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 35 “Integrations”, Section 35.1 “Configuring for 3PAR Block Storage Backend”

8.1.2 Creating a Volume Type for your Volumes

Creating volume types allows you to create standard specifications for your volumes.

Volume types are used to specify a standard Block Storage back-end and collection of extra specifications for your volumes. This allows an administrator to give its users a variety of options while simplifying the process of creating volumes.

The tasks involved in this process are:

8.1.2.1 Create a Volume Type for your Volumes

The default volume type will be thin provisioned and will have no fault tolerance (RAID 0). You should configure cinder to fully provision volumes, and you may want to configure fault tolerance. Follow the instructions below to create a new volume type that is fully provisioned and fault tolerant:

Perform the following steps to create a volume type using the horizon GUI:

  1. Log in to the horizon dashboard.

  2. Ensure that you are scoped to your admin Project. Then under the Admin menu in the navigation pane, click on Volumes under the System subheading.

  3. Select the Volume Types tab and then click the Create Volume Type button to display a dialog box.

  4. Enter a unique name for the volume type and then click the Create Volume Type button to complete the action.

The newly created volume type will be displayed in the Volume Types list confirming its creation.

Important
Important

You must set a default_volume_type in cinder.conf.j2, whether it is default_type or one you have created. For more information, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 35 “Integrations”, Section 35.1 “Configuring for 3PAR Block Storage Backend”, Section 35.1.4 “Configure 3PAR FC as a Cinder Backend”.

8.1.2.2 Associate the Volume Type to the Back-end

After the volume type(s) have been created, you can assign extra specification attributes to the volume types. Each Block Storage back-end option has unique attributes that can be used.

To map a volume type to a back-end, do the following:

  1. Log into the horizon dashboard.

  2. Ensure that you are scoped to your admin Project (for more information, see Section 5.10.7, “Scope Federated User to Domain”. Then under the Admin menu in the navigation pane, click on Volumes under the System subheading.

  3. Click the Volume Type tab to list the volume types.

  4. In the Actions column of the Volume Type you created earlier, click the drop-down option and select View Extra Specs which will bring up the Volume Type Extra Specs options.

  5. Click the Create button on the Volume Type Extra Specs screen.

  6. In the Key field, enter one of the key values in the table in the next section. In the Value box, enter its corresponding value. Once you have completed that, click the Create button to create the extra volume type specs.

Once the volume type is mapped to a back-end, you can create volumes with this volume type.

8.1.2.3 Extra Specification Options for 3PAR

3PAR supports volumes creation with additional attributes. These attributes can be specified using the extra specs options for your volume type. The administrator is expected to define appropriate extra spec for 3PAR volume type as per the guidelines provided at http://docs.openstack.org/liberty/config-reference/content/hp-3par-supported-ops.html.

The following cinder Volume Type extra-specs options enable control over the 3PAR storage provisioning type:

KeyValueDescription
volume_backend_namevolume backend name

The name of the back-end to which you want to associate the volume type, which you also specified earlier in the cinder.conf.j2 file.

hp3par:provisioning (optional)thin, full, or dedup 

For more information, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 35 “Integrations”, Section 35.1 “Configuring for 3PAR Block Storage Backend”.

8.1.3 Managing cinder Volume and Backup Services

Important
Important: Use Only When Needed

If the host running the cinder-volume service fails for any reason, it should be restarted as quickly as possible. Often, the host running cinder services also runs high availability (HA) services such as MariaDB and RabbitMQ. These HA services are at risk while one of the nodes in the cluster is down. If it will take a significant amount of time to recover the failed node, then you may migrate the cinder-volume service and its backup service to one of the other controller nodes. When the node has been recovered, you should migrate the cinder-volume service and its backup service to the original (default) node.

The cinder-volume service and its backup service migrate as a pair. If you migrate the cinder-volume service, its backup service will also be migrated.

8.1.3.1 Migrating the cinder-volume service

The following steps will migrate the cinder-volume service and its backup service.

  1. Log in to the Cloud Lifecycle Manager node.

  2. Determine the host index numbers for each of your control plane nodes. This host index number will be used in a later step. They can be obtained by running this playbook:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts cinder-show-volume-hosts.yml

    Here is an example snippet showing the output of a single three node control plane, with the host index numbers in bold:

    TASK: [_CND-CMN | show_volume_hosts | Show cinder Volume hosts index and hostname] ***
    ok: [ardana-cp1-c1-m1] => (item=(0, 'ardana-cp1-c1-m1')) => {
        "item": [
            0,
            "ardana-cp1-c1-m1"
        ],
        "msg": "Index 0 Hostname ardana-cp1-c1-m1"
    }
    ok: [ardana-cp1-c1-m1] => (item=(1, 'ardana-cp1-c1-m2')) => {
        "item": [
            1,
            "ardana-cp1-c1-m2"
        ],
        "msg": "Index 1 Hostname ardana-cp1-c1-m2"
    }
    ok: [ardana-cp1-c1-m1] => (item=(2, 'ardana-cp1-c1-m3')) => {
        "item": [
            2,
            "ardana-cp1-c1-m3"
        ],
        "msg": "Index 2 Hostname ardana-cp1-c1-m3"
    }
  3. Locate the control plane fact file for the control plane you need to migrate the service from. It will be located in the following directory:

    /etc/ansible/facts.d/

    These fact files use the following naming convention:

    cinder_volume_run_location_<control_plane_name>.fact
  4. Edit the fact file to include the host index number of the control plane node you wish to migrate the cinder-volume services to. For example, if they currently reside on your first controller node, host index 0, and you wish to migrate them to your second controller, you would change the value in the fact file to 1.

  5. If you are using data encryption on your Cloud Lifecycle Manager, ensure you have included the encryption key in your environment variables. For more information see Book “Security Guide”, Chapter 10 “Encryption of Passwords and Sensitive Data”.

    export HOS_USER_PASSWORD_ENCRYPT_KEY=<encryption key>
  6. After you have edited the control plane fact file, run the cinder volume migration playbook for the control plane nodes involved in the migration. At minimum this includes the one to start cinder-volume manager on and the one on which to stop it:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts cinder-migrate-volume.yml --limit=<limit_pattern1,limit_pattern2>
    Note
    Note

    <limit_pattern> is the pattern used to limit the hosts that are selected to those within a specific control plane. For example, with the nodes in the snippet shown above, --limit=>ardana-cp1-c1-m1,ardana-cp1-c1-m2<

  7. Even though the playbook summary reports no errors, you may disregard informational messages such as:

    msg: Marking ardana_notify_cinder_restart_required to be cleared from the fact cache
  8. Ensure that once your maintenance or other tasks are completed that you migrate the cinder-volume services back to their original node using these same steps.

9 Managing Object Storage

Information about managing and configuring the Object Storage service.

The Object Storage service may be deployed in a full-fledged manner, with proxy nodes engaging rings for managing the accounts, containers, and objects being stored. Or, it may simply be deployed as a front-end to SUSE Enterprise Storage, offering Object Storage APIs with an external back-end.

In the former case, managing your Object Storage environment includes tasks related to ensuring your swift rings stay balanced, and that and other topics are discussed in more detail in this section. swift includes many commands and utilities for these purposes.

When used as a front-end to SUSE Enterprise Storage, many swift constructs such as rings and ring balancing, replica dispersion, etc. do not apply, as swift itself is not responsible for the mechanics of object storage.

9.1 Running the swift Dispersion Report

swift contains a tool called swift-dispersion-report that can be used to determine whether your containers and objects have three replicas like they are supposed to. This tool works by populating a percentage of partitions in the system with containers and objects (using swift-dispersion-populate) and then running the report to see if all the replicas of these containers and objects are in the correct place. For a more detailed explanation of this tool in Openstack swift, please see OpenStack swift - Administrator's Guide.

9.1.1 Configuring the swift dispersion populate

Once a swift system has been fully deployed in SUSE OpenStack Cloud 9, you can setup the swift-dispersion-report using the default parameters found in ~/openstack/ardana/ansible/roles/swift-dispersion/templates/dispersion.conf.j2. This populates 1% of the partitions on the system and if you are happy with this figure, please proceed to step 2 below. Otherwise, follow step 1 to edit the configuration file.

  1. If you wish to change the dispersion coverage percentage, then connect to the Cloud Lifecycle Manager server and change the value of dispersion_coverage in the ~/openstack/ardana/ansible/roles/swift-dispersion/templates/dispersion.conf.j2 file to the value you wish to use. In the example below we have altered the file to create 5% dispersion:

    ...
    [dispersion]
    auth_url = {{ keystone_identity_uri }}/v3
    auth_user = {{ swift_dispersion_tenant }}:{{ swift_dispersion_user }}
    auth_key = {{ swift_dispersion_password  }}
    endpoint_type = {{ endpoint_type }}
    auth_version = {{ disp_auth_version }}
    # Set this to the percentage coverage. We recommend a value
    # of 1%. You can increase this to get more coverage. However, if you
    # decrease the value, the dispersion containers and objects are
    # not deleted.
    dispersion_coverage = 5.0
  2. Commit your configuration to the Git repository (Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”), as follows:

    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  3. Run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  4. Update your deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  5. Reconfigure the swift servers:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml
  6. Run this playbook to populate your swift system for the health check:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts swift-dispersion-populate.yml

9.1.2 Running the swift dispersion report

Check the status of the swift system by running the swift dispersion report with this playbook:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts swift-dispersion-report.yml

The output of the report will look similar to this:

TASK: [swift-dispersion | report | Display dispersion report results] *********
ok: [padawan-ccp-c1-m1-mgmt] => {
    "var": {
        "dispersion_report_result.stdout_lines": [
            "Using storage policy: General ",
            "",
            "[KQueried 40 containers for dispersion reporting, 0s, 0 retries",
            "100.00% of container copies found (120 of 120)",
            "Sample represents 0.98% of the container partition space",
            "",
            "[KQueried 40 objects for dispersion reporting, 0s, 0 retries",
            "There were 40 partitions missing 0 copies.",
            "100.00% of object copies found (120 of 120)",
            "Sample represents 0.98% of the object partition space"
        ]
    }
}
...

In addition to being able to run the report above, there will be a cron-job scheduled to run every 2 hours located on the primary proxy node of your cloud environment. It will run dispersion-report and save the results to the following location on its local filesystem:

/var/cache/swift/dispersion-report

When interpreting the results you get from this report, we recommend using swift Administrator's Guide - Cluster Health

9.2 Gathering Swift Data

The swift-recon command retrieves data from swift servers and displays the results. To use this command, log on as a root user to any node which is running the swift-proxy service.

9.2.1 Notes

For help with the swift-recon command you can use this:

tux > sudo swift-recon --help
Warning
Warning

The --driveaudit option is not supported.

Warning
Warning

SUSE OpenStack Cloud does not support ec_type isa_l_rs_vand and ec_num_parity_fragments greater than or equal to 5 in the storage-policy configuration. This particular policy is known to harm data durability.

9.2.2 Using the swift-recon Command

The following command retrieves and displays disk usage information:

tux > sudo swift-recon --diskusage

For example:

tux > sudo swift-recon --diskusage
===============================================================================
--> Starting reconnaissance on 3 hosts
===============================================================================
[2015-09-14 16:01:40] Checking disk usage now
Distribution Graph:
 10%    3 *********************************************************************
 11%    1 ***********************
 12%    2 **********************************************
Disk usage: space used: 13745373184 of 119927734272
Disk usage: space free: 106182361088 of 119927734272
Disk usage: lowest: 10.39%, highest: 12.96%, avg: 11.4613798613%
===============================================================================

In the above example, the results for several nodes are combined together. You can also view the results from individual nodes by adding the -v option as shown in the following example:

tux > sudo swift-recon --diskusage -v
===============================================================================
--> Starting reconnaissance on 3 hosts
===============================================================================
[2015-09-14 16:12:30] Checking disk usage now
-> http://192.168.245.3:6000/recon/diskusage: [{'device': 'disk1', 'avail': 17398411264, 'mounted': True, 'used': 2589544448, 'size': 19987955712}, {'device': 'disk0', 'avail': 17904222208, 'mounted': True, 'used': 2083733504, 'size': 19987955712}]
-> http://192.168.245.2:6000/recon/diskusage: [{'device': 'disk1', 'avail': 17769721856, 'mounted': True, 'used': 2218233856, 'size': 19987955712}, {'device': 'disk0', 'avail': 17793581056, 'mounted': True, 'used': 2194374656, 'size': 19987955712}]
-> http://192.168.245.4:6000/recon/diskusage: [{'device': 'disk1', 'avail': 17912147968, 'mounted': True, 'used': 2075807744, 'size': 19987955712}, {'device': 'disk0', 'avail': 17404235776, 'mounted': True, 'used': 2583719936, 'size': 19987955712}]
Distribution Graph:
 10%    3 *********************************************************************
 11%    1 ***********************
 12%    2 **********************************************
Disk usage: space used: 13745414144 of 119927734272
Disk usage: space free: 106182320128 of 119927734272
Disk usage: lowest: 10.39%, highest: 12.96%, avg: 11.4614140152%
===============================================================================

By default, swift-recon uses the object-0 ring for information about nodes and drives. For some commands, it is appropriate to specify account, container, or object to indicate the type of ring. For example, to check the checksum of the account ring, use the following:

tux > sudo swift-recon --md5 account
===============================================================================
--> Starting reconnaissance on 3 hosts
===============================================================================
[2015-09-14 16:17:28] Checking ring md5sums
3/3 hosts matched, 0 error[s] while checking hosts.
===============================================================================
[2015-09-14 16:17:28] Checking swift.conf md5sum
3/3 hosts matched, 0 error[s] while checking hosts.
===============================================================================

9.3 Gathering Swift Monitoring Metrics

The swiftlm-scan command is the mechanism used to gather metrics for the monasca system. These metrics are used to derive alarms. For a list of alarms that can be generated from this data, see Section 18.1.1, “Alarm Resolution Procedures”.

To view the metrics, use the swiftlm-scan command directly. Log on to the swift node as the root user. The following example shows the command and a snippet of the output:

tux > sudo swiftlm-scan --pretty
. . .
  {
    "dimensions": {
      "device": "sdc",
      "hostname": "padawan-ccp-c1-m2-mgmt",
      "service": "object-storage"
    },
    "metric": "swiftlm.swift.drive_audit",
    "timestamp": 1442248083,
    "value": 0,
    "value_meta": {
      "msg": "No errors found on device: sdc"
    }
  },
. . .
Note
Note

To make the JSON file easier to read, use the --pretty option.

The fields are as follows:

metric

Specifies the name of the metric.

dimensions

Provides information about the source or location of the metric. The dimensions differ depending on the metric in question. The following dimensions are used by swiftlm-scan:

  • service: This is always object-storage.

  • component: This identifies the component. For example, swift-object-server indicates that the metric is about the swift-object-server process.

  • hostname: This is the name of the node the metric relates to. This is not necessarily the name of the current node.

  • url: If the metric is associated with a URL, this is the URL.

  • port: If the metric relates to connectivity to a node, this is the port used.

  • device: This is the block device a metric relates to.

value

The value of the metric. For many metrics, this is simply the value of the metric. However, if the value indicates a status. If value_meta contains a msg field, the value is a status. The following status values are used:

  • 0 - no error

  • 1 - warning

  • 2 - failure

value_meta

Additional information. The msg field is the most useful of this information.

9.3.1 Optional Parameters

You can focus on specific sets of metrics by using one of the following optional parameters:

--replication

Checks replication and health status.

--file-ownership

Checks that swift owns its relevant files and directories.

--drive-audit

Checks for logged events about corrupted sectors (unrecoverable read errors) on drives.

--connectivity

Checks connectivity to various servers used by the swift system, including:

  • Checks this node can connect to all memcachd servers

  • Checks that this node can connect to the keystone service (only applicable if this is a proxy server node)

--swift-services

Check that the relevant swift processes are running.

--network-interface

Checks NIC speed and reports statistics for each interface.

--check-mounts

Checks that the node has correctly mounted drives used by swift.

--hpssacli

If this server uses a Smart Array Controller, this checks the operation of the controller and disk drives.

9.4 Using the swift Command-line Client (CLI)

OpenStackClient (OSC) is a command-line client for OpenStack with a uniform command structure for OpenStack services. Some swift commands do not have OSC equivalents. The swift utility (or swift CLI) is installed on the Cloud Lifecycle Manager node and also on all other nodes running the swift proxy service. To use this utility on the Cloud Lifecycle Manager, you can use the ~/service.osrc file as a basis and then edit it with the credentials of another user if you need to.

ardana > cp ~/service.osrc ~/swiftuser.osrc

Then you can use your preferred editor to edit swiftuser.osrc so you can authenticate using the OS_USERNAME, OS_PASSWORD, and OS_PROJECT_NAME you wish to use. For example, if you want use the demo user that is created automatically for you, it would look like this:

unset OS_DOMAIN_NAME
export OS_IDENTITY_API_VERSION=3
export OS_AUTH_VERSION=3
export OS_PROJECT_NAME=demo
export OS_PROJECT_DOMAIN_NAME=Default
export OS_USERNAME=demo
export OS_USER_DOMAIN_NAME=Default
export OS_PASSWORD=<password>
export OS_AUTH_URL=<auth_URL>
export OS_ENDPOINT_TYPE=internalURL
# OpenstackClient uses OS_INTERFACE instead of OS_ENDPOINT
export OS_INTERFACE=internal
export OS_CACERT=/etc/ssl/certs/ca-certificates.crt
export OS_COMPUTE_API_VERSION=2

You must use the appropriate password for the demo user and select the correct endpoint for the OS_AUTH_URL value, which should be in the ~/service.osrc file you copied.

You can then examine the following account data using this command:

ardana > openstack object store account show

Example showing an environment with no containers or objects:

ardana > openstack object store account show
        Account: AUTH_205804d000a242d385b8124188284998
     Containers: 0
        Objects: 0
          Bytes: 0
X-Put-Timestamp: 1442249536.31989
     Connection: keep-alive
    X-Timestamp: 1442249536.31989
     X-Trans-Id: tx5493faa15be44efeac2e6-0055f6fb3f
   Content-Type: text/plain; charset=utf-8

Use the following command to create a container:

ardana > openstack container create CONTAINER_NAME

Example, creating a container named documents:

ardana > openstack container create documents

The newly created container appears. But there are no objects:

ardana > openstack container show documents
         Account: AUTH_205804d000a242d385b8124188284998
       Container: documents
         Objects: 0
           Bytes: 0
        Read ACL:
       Write ACL:
         Sync To:
        Sync Key:
   Accept-Ranges: bytes
X-Storage-Policy: General
      Connection: keep-alive
     X-Timestamp: 1442249637.69486
      X-Trans-Id: tx1f59d5f7750f4ae8a3929-0055f6fbcc
    Content-Type: text/plain; charset=utf-8

Upload a document:

ardana > openstack object create CONTAINER_NAME FILENAME

Example:

ardana > openstack object create documents mydocument
mydocument

List objects in the container:

ardana > openstack object list CONTAINER_NAME

Example using a container called documents:

ardana > openstack object list documents
mydocument
Note
Note

This is a brief introduction to the swift CLI. Use the swift --help command for more information. You can also use the OpenStack CLI, see openstack -h for more information.

9.5 Managing swift Rings

swift rings are a machine-readable description of which disk drives are used by the Object Storage service (for example, a drive is used to store account or object data). Rings also specify the policy for data storage (for example, defining the number of replicas). The rings are automatically built during the initial deployment of your cloud, with the configuration provided during setup of the SUSE OpenStack Cloud Input Model. For more information, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 5 “Input Model”.

After successful deployment of your cloud, you may want to change or modify the configuration for swift. For example, you may want to add or remove swift nodes, add additional storage policies, or upgrade the size of the disk drives. For instructions, see Section 9.5.5, “Applying Input Model Changes to Existing Rings” and Section 9.5.6, “Adding a New Swift Storage Policy”.

Note
Note

The process of modifying or adding a configuration is similar to other configuration or topology changes in the cloud. Generally, you make the changes to the input model files at ~/openstack/my_cloud/definition/ on the Cloud Lifecycle Manager and then run Ansible playbooks to reconfigure the system.

Changes to the rings require several phases to complete, therefore, you may need to run the playbooks several times over several days.

The following topics cover ring management.

9.5.1 Rebalancing Swift Rings

The swift ring building process tries to distribute data evenly among the available disk drives. The data is stored in partitions. (For more information, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 11 “Modifying Example Configurations for Object Storage using Swift”, Section 11.10 “Understanding Swift Ring Specifications”.) If you, for example, double the number of disk drives in a ring, you need to move 50% of the partitions to the new drives so that all drives contain the same number of partitions (and hence same amount of data). However, it is not possible to move the partitions in a single step. It can take minutes to hours to move partitions from the original drives to their new drives (this process is called the replication process).

If you move all partitions at once, there would be a period where swift would expect to find partitions on the new drives, but the data has not yet replicated there so that swift could not return the data to the user. Therefore, swift will not be able to find all of the data in the middle of replication because some data has finished replication while other bits of data are still in the old locations and have not yet been moved. So it is considered best practice to move only one replica at a time. If the replica count is 3, you could first move 16.6% of the partitions and then wait until all data has replicated. Then move another 16.6% of partitions. Wait again and then finally move the remaining 16.6% of partitions. For any given object, only one of the replicas is moved at a time.

9.5.1.1 Reasons to Move Partitions Gradually

Due to the following factors, you must move the partitions gradually:

  • Not all devices are of the same size. SUSE OpenStack Cloud 9 automatically assigns different weights to drives so that smaller drives store fewer partitions than larger drives.

  • The process attempts to keep replicas of the same partition in different servers.

  • Making a large change in one step (for example, doubling the number of drives in the ring), would result in a lot of network traffic due to the replication process and the system performance suffers. There are two ways to mitigate this:

9.5.2 Using the Weight-Step Attributes to Prepare for Ring Changes

swift rings are built during a deployment and this process sets the weights of disk drives such that smaller disk drives have a smaller weight than larger disk drives. When making changes in the ring, you should limit the amount of change that occurs. SUSE OpenStack Cloud 9 does this by limiting the weights of the new drives to a smaller value and then building new rings. Once the replication process has finished, SUSE OpenStack Cloud 9 will increase the weight and rebuild rings to trigger another round of replication. (For more information, see Section 9.5.1, “Rebalancing Swift Rings”.)

In addition, you should become familiar with how the replication process behaves on your system during normal operation. Before making ring changes, use the swift-recon command to determine the typical oldest replication times for your system. For instructions, see Section 9.5.4, “Determining When to Rebalance and Deploy a New Ring”.

In SUSE OpenStack Cloud, the weight-step attribute is set in the ring specification of the input model. The weight-step value specifies a maximum value for the change of the weight of a drive in any single rebalance. For example, if you add a drive of 4TB, you would normally assign a weight of 4096. However, if the weight-step attribute is set to 1024 instead then when you add that drive the weight is initially set to 1024. The next time you rebalance the ring, the weight is set to 2048. The subsequent rebalance would then set the weight to the final value of 4096.

The value of the weight-step attribute is dependent on the size of the drives, number of the servers being added, and how experienced you are with the replication process. A common starting value is to use 20% of the size of an individual drive. For example, when adding X number of 4TB drives a value of 820 would be appropriate. As you gain more experience with your system, you may increase or reduce this value.

9.5.2.1 Setting the weight-step attribute

Perform the following steps to set the weight-step attribute:

  1. Log in to the Cloud Lifecycle Manager.

  2. Edit the ~/openstack/my_cloud/definition/data/swift/swift_config.yml file containing the ring-specifications for the account, container, and object rings.

    Add the weight-step attribute to the ring in this format:

    - name: account
      weight-step: WEIGHT_STEP_VALUE
      display-name: Account Ring
      min-part-hours: 16
      ...

    For example, to set weight-step to 820, add the attribute like this:

    - name: account
      weight-step: 820
      display-name: Account Ring
      min-part-hours: 16
      ...
  3. Repeat step 2 for the other rings, if necessary (container, object-0, etc).

  4. Run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  5. Use the playbook to create a deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  6. To complete the configuration, use the ansible playbooks documented in Section 9.5.3, “Managing Rings Using swift Playbooks”.

9.5.3 Managing Rings Using swift Playbooks

The following table describes how playbooks relate to ring management.

All of these playbooks will be run from the Cloud Lifecycle Manager from the ~/scratch/ansible/next/ardana/ansible directory.

PlaybookDescriptionNotes
swift-update-from-model-rebalance-rings.yml

There are two steps in this playbook:

  • Make delta

    It processes the input model and compares it against the existing rings. After comparison, it produces a list of differences between the input model and the existing rings. This is called the ring delta. The ring delta covers drives being added, drives being removed, weight changes, and replica count changes.

  • Rebalance

    The ring delta is then converted into a series of commands (such as add) to the swift-ring-builder program. Finally, the rebalance command is issued to the swift-ring-builder program.

This playbook performs its actions on the first node running the swift-proxy service. (For more information, see Section 18.6.2.4, “Identifying the Swift Ring Building Server”.) However, it also scans all swift nodes to find the size of disk drives.

If there are no changes in the ring delta, the rebalance command is still executed to rebalance the rings. If min-part-hours has not yet elapsed or if no partitions need to be moved, new rings are not written.

swift-compare-model-rings.yml

There are two steps in this playbook:

  • Make delta

    This is the same as described for swift-update-from-model-rebalance-rings.yml.

  • Report

    This prints a summary of the proposed changes that will be made to the rings (that is, what would happen if you rebalanced).

The playbook reports any issues or problems it finds with the input model.

This playbook can be useful to confirm that there are no errors in the input model. It also allows you to check that when you change the input model, that the proposed ring changes are as expected. For example, if you have added a server to the input model, but this playbook reports that no drives are being added, you should determine the cause.

There is troubleshooting information related to the information that you receive in this report that you can view on this page: Section 18.6.2.3, “Interpreting Swift Input Model Validation Errors”.

swift-deploy.yml

swift-deploy.yml is responsible for installing software and configuring swift on nodes. As part of installing and configuring, it runs the swift-update-from-model-rebalance-rings.yml and swift-reconfigure.yml playbooks.

This playbook is included in the ardana-deploy.yml and site.yml playbooks, so if you run either of those playbooks, the swift-deploy.yml playbook is also run.

swift-reconfigure.yml

swift-reconfigure.yml takes rings that the swift-update-from-model-rebalance-rings.yml playbook has changed and copies those rings to all swift nodes.

Every time that you directly use the swift-update-from-model-rebalance-rings.yml playbook, you must copy these rings to the system using the swift-reconfigure.yml playbook. If you forget and run swift-update-from-model-rebalance-rings.yml twice, the process may move two replicates of some partitions at the same time.

9.5.3.1 Optional Ansible variables related to ring management

The following optional variables may be specified when running the playbooks outlined above. They are specified using the --extra-vars option.

VariableDescription and Use
limit_ring

Limit changes to the named ring. Other rings will not be examined or updated. This option may be used with any of the swift playbooks. For example, to only update the object-1 ring, use the following command:

ardana > ansible-playbook -i hosts/verb_hosts swift-update-from-model-rebalance-rings.yml --extra-vars "limit-ring=object-1"
drive_detail

Used only with the swift-compare-model-rings.yml playbook. The playbook will include details of changes to every drive where the model and existing rings differ. If you omit the drive_detail variable, only summary information is provided. The following shows how to use the drive_detail variable:

ardana > ansible-playbook -i hosts/verb_hosts swift-compare-model-rings.yml --extra-vars "drive_detail=yes"

9.5.3.2 Interpreting the report from the swift-compare-model-rings.yml playbook

The swift-compare-model-rings.yml playbook compares the existing swift rings with the input model and prints a report telling you how the rings and the model differ. Specifically, it will tell you what actions will take place when you next run the swift-update-from-model-rebalance-rings.yml playbook (or a playbook such as ardana-deploy.yml that runs swift-update-from-model-rebalance-rings.yml).

The swift-compare-model-rings.yml playbook will make no changes, but is just an advisory report.

Here is an example output from the playbook. The report is between "report.stdout_lines" and "PLAY RECAP":

TASK: [swiftlm-ring-supervisor | validate-input-model | Print report] *********
ok: [ardana-cp1-c1-m1-mgmt] => {
    "var": {
        "report.stdout_lines": [
            "Rings:",
            "  ACCOUNT:",
            "    ring exists (minimum time to next rebalance: 8:07:33)",
            "    will remove 1 devices (18.00GB)",
            "    ring will be rebalanced",
            "  CONTAINER:",
            "    ring exists (minimum time to next rebalance: 8:07:35)",
            "    no device changes",
            "    ring will be rebalanced",
            "  OBJECT-0:",
            "    ring exists (minimum time to next rebalance: 8:07:34)",
            "    no device changes",
            "    ring will be rebalanced"
        ]
    }
}

The following describes the report in more detail:

MessageDescription

ring exists

The ring already exists on the system.

ring will be created

The ring does not yet exist on the system.

no device changes

The devices in the ring exactly match the input model. There are no servers being added or removed and the weights are appropriate for the size of the drives.

minimum time to next rebalance

If this time is 0:00:00, if you run one of the swift playbooks that update rings, the ring will be rebalanced.

If the time is non-zero, it means that not enough time has elapsed since the ring was last rebalanced. Even if you run a swift playbook that attempts to change the ring, the ring will not actually rebalance. This time is determined by the min-part-hours attribute.

set-weight ardana-ccp-c1-m1-mgmt:disk0:/dev/sdc 8.00 > 12.00 > 18.63

The weight of disk0 (mounted on /dev/sdc) on server ardana-ccp-c1-m1-mgmt is currently set to 8.0 but should be 18.83 given the size of the drive. However, in this example, we cannot go directly from 8.0 to 18.63 because of the weight-step attribute. Hence, the proposed weight change is from 8.0 to 12.0.

This information is only shown when you the drive_detail=yes argument when running the playbook.

will change weight on 12 devices (6.00TB)

The weight of 12 devices will be increased. This might happen for example, if a server had been added in a prior ring update. However, with use of the weight-step attribute, the system gradually increases the weight of these new devices. In this example, the change in weight represents 6TB of total available storage. For example, if your system currently has 100TB of available storage, when the weight of these devices is changed, there will be 106TB of available storage. If your system is 50% utilized, this means that when the ring is rebalanced, up to 3TB of data may be moved by the replication process. This is an estimate - in practice, because only one copy of a given replica is moved in any given rebalance, it may not be possible to move this amount of data in a single ring rebalance.

add: ardana-ccp-c1-m1-mgmt:disk0:/dev/sdc

The disk0 device will be added to the ardana-ccp-c1-m1-mgmt server. This happens when a server is added to the input model or if a disk model is changed to add additional devices.

This information is only shown when you the drive_detail=yes argument when running the playbook.

remove: ardana-ccp-c1-m1-mgmt:disk0:/dev/sdc

The device is no longer in the input model and will be removed from the ring. This happens if a server is removed from the model, a disk drive is removed from a disk model or the server is marked for removal using the pass-through feature.

This information is only shown when you the drive_detail=yes argument when running the playbook.

will add 12 devices (6TB)

There are 12 devices in the input model that have not yet been added to the ring. Usually this is because one or more servers have been added. In this example, this could be one server with 12 drives or two servers, each with 6 drives. The size in the report is the change in total available capacity. When the weight-step attribute is used, this may be a fraction of the total size of the disk drives. In this example, 6TB of capacity is being added. For example, if your system currently has 100TB of available storage, when these devices are added, there will be 106TB of available storage. If your system is 50% utilized, this means that when the ring is rebalanced, up to 3TB of data may be moved by the replication process. This is an estimate - in practice, because only one copy of a given replica is moved in any given rebalance, it may not be possible to move this amount of data in a single ring rebalance.

will remove 12 devices (6TB)

There are 12 devices in rings that no longer appear in the input model. Usually this is because one or more servers have been removed. In this example, this could be one server with 12 drives or two servers, each with 6 drives. The size in the report is the change in total removed capacity. In this example, 6TB of capacity is being removed. For example, if your system currently has 100TB of available storage, when these devices are removed, there will be 94TB of available storage. If your system is 50% utilized, this means that when the ring is rebalanced, approximately 3TB of data must be moved by the replication process.

min-part-hours will be changed

The min-part-hours attribute has been changed in the ring specification in the input model.

replica-count will be changed

The replica-count attribute has been changed in the ring specification in the input model.

ring will be rebalanced

This is always reported. Every time the swift-update-from-model-rebalance-rings.yml playbook is run, it will execute the swift-ring-builder rebalance command. This happens even if there were no input model changes. If the ring is already well balanced, the swift-ring-builder will not rewrite the ring.

9.5.4 Determining When to Rebalance and Deploy a New Ring

Before deploying a new ring, you must be sure the change that has been applied to the last ring is complete (that is, all the partitions are in their correct location). There are three aspects to this:

  • Is the replication system busy?

    You might want to postpone a ring change until after replication has finished. If the replication system is busy repairing a failed drive, a ring change will place additional load on the system. To check that replication has finished, use the swift-recon command with the --replication argument. (For more information, see Section 9.2, “Gathering Swift Data”.) The oldest completion time can indicate that the replication process is very busy. If it is more than 15 or 20 minutes then the object replication process are probably still very busy. The following example indicates that the oldest completion is 120 seconds, so that the replication process is probably not busy:

    root # swift-recon --replication
    ===============================================================================
    --> Starting reconnaissance on 3 hosts
    ===============================================================================
    [2015-10-02 15:31:45] Checking on replication
    [replication_time] low: 0, high: 0, avg: 0.0, total: 0, Failed: 0.0%, no_result: 0, reported: 3
    Oldest completion was 2015-10-02 15:31:32 (120 seconds ago) by 192.168.245.4:6000.
    Most recent completion was 2015-10-02 15:31:43 (10 seconds ago) by 192.168.245.3:6000.
    ===============================================================================
  • Are there drive or server failures?

    A drive failure does not preclude deploying a new ring. In principle, there should be two copies elsewhere. However, another drive failure in the middle of replication might make data temporary unavailable. If possible, postpone ring changes until all servers and drives are operating normally.

  • Has min-part-hours elapsed?

    The swift-ring-builder will refuse to build a new ring until the min-part-hours has elapsed since the last time it built rings. You must postpone changes until this time has elapsed.

    You can determine how long you must wait by running the swift-compare-model-rings.yml playbook, which will tell you how long you until the min-part-hours has elapsed. For more details, see Section 9.5.3, “Managing Rings Using swift Playbooks”.

    You can change the value of min-part-hours. (For instructions, see Section 9.5.7, “Changing min-part-hours in Swift”).

  • Is the swift dispersion report clean?

    Run the swift-dispersion-report.yml playbook (as described in Section 9.1, “Running the swift Dispersion Report”) and examine the results. If the replication process has not yet replicated partitions that were moved to new drives in the last ring rebalance, the dispersion report will indicate that some containers or objects are missing a copy.

    For example:

    There were 462 partitions missing one copy.

    Assuming all servers and disk drives are operational, the reason for the missing partitions is that the replication process has not yet managed to copy a replica into the partitions.

    You should wait an hour and rerun the dispersion report process and examine the report. The number of partitions missing one copy should have reduced. Continue to wait until this reaches zero before making any further ring rebalances.

    Note
    Note

    It is normal to see partitions missing one copy if disk drives or servers are down. If all servers and disk drives are mounted, and you did not recently perform a ring rebalance, you should investigate whether there are problems with the replication process. You can use the Operations Console to investigate replication issues.

    Important
    Important

    If there are any partitions missing two copies, you must reboot or repair any failed servers and disk drives as soon as possible. Do not shutdown any swift nodes in this situation. Assuming a replica count of 3, if you are missing two copies you are in danger of losing the only remaining copy.

9.5.5 Applying Input Model Changes to Existing Rings

This page describes a general approach for making changes to your existing swift rings. This approach applies to actions such as adding and removing a server and replacing and upgrading disk drives, and must be performed as a series of phases, as shown below:

9.5.5.1 Changing the Input Model Configuration Files

The first step to apply new changes to the swift environment is to update the configuration files. Follow these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Set the weight-step attribute, as needed, for the nodes you are altering. (For instructions, see Section 9.5.2, “Using the Weight-Step Attributes to Prepare for Ring Changes”).

  3. Edit the configuration files as part of the Input Model as appropriate. (For general information about the Input Model, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”, Section 6.14 “Networks”. For more specific information about the swift parts of the configuration files, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 11 “Modifying Example Configurations for Object Storage using Swift”)

  4. Once you have completed all of the changes, commit your configuration to the local git repository. (For more information, seeBook “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”.) :

    ardana > git add -A
    root # git commit -m "commit message"
  5. Run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  6. Create a deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  7. Run the swift playbook that will validate your configuration files and give you a report as an output:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    root # ansible-playbook -i hosts/verb_hosts swift-compare-model-rings.yml
  8. Use the report to validate that the number of drives proposed to be added or deleted, or the weight change, is correct. Fix any errors in your input model. At this stage, no changes have been made to rings.

9.5.5.2 First phase of Ring Rebalance

To begin the rebalancing of the swift rings, follow these steps:

  1. After going through the steps in the section above, deploy your changes to all of the swift nodes in your environment by running this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts swift-deploy.yml
  2. Wait until replication has finished or min-part-hours has elapsed (whichever is longer). For more information, see Section 9.5.4, “Determining When to Rebalance and Deploy a New Ring”

9.5.5.3 Weight Change Phase of Ring Rebalance

At this stage, no changes have been made to the input model. However, when you set the weight-step attribute, the rings that were rebuilt in the previous rebalance phase have weights that are different than their target/final value. You gradually move to the target/final weight by rebalancing a number of times as described on this page. For more information about the weight-step attribute, see Section 9.5.2, “Using the Weight-Step Attributes to Prepare for Ring Changes”.

To begin the re-balancing of the rings, follow these steps:

  1. Rebalance the rings by running the playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts swift-update-from-model-rebalance-rings.yml
  2. Run the reconfiguration:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml
  3. Wait until replication has finished or min-part-hours has elapsed (whichever is longer). For more information, see Section 9.5.4, “Determining When to Rebalance and Deploy a New Ring”

  4. Run the following command and review the report:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts swift-compare-model-rings.yml --limit SWF*

    The following is an example of the output after executing the above command. In the example no weight changes are proposed:

    TASK: [swiftlm-ring-supervisor | validate-input-model | Print report] *********
    ok: [padawan-ccp-c1-m1-mgmt] => {
        "var": {
            "report.stdout_lines": [
                "Need to add 0 devices",
                "Need to remove 0 devices",
                "Need to set weight on 0 devices"
            ]
        }
    }
  5. When there are no proposed weight changes, you proceed to the final phase.

  6. If there are proposed weight changes repeat this phase again.

9.5.5.4 Final Rebalance Phase

The final rebalance phase moves all replicas to their final destination.

  1. Rebalance the rings by running the playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts swift-update-from-model-rebalance-rings.yml | tee /tmp/rebalance.log
    Note
    Note

    The output is saved for later reference.

  2. Review the output from the previous step. If the output for all rings is similar to the following, the rebalance had no effect. That is, the rings are balanced and no further changes are needed. In addition, the ring files were not changed so you do not need to deploy them to the swift nodes:

    "Running: swift-ring-builder /etc/swiftlm/cloud1/cp1/builder_dir/account.builder rebalance 999",
          "NOTE: No partitions could be reassigned.",
          "Either none need to be or none can be due to min_part_hours [16]."

    The text No partitions could be reassigned indicates that no further rebalances are necessary. If this is true for all the rings, you have completed the final phase.

    Note
    Note

    You must have allowed enough time to elapse since the last rebalance. As mentioned in the above example, min_part_hours [16] means that you must wait at least 16 hours since the last rebalance. If not, you should wait until enough time has elapsed and repeat this phase.

  3. Run the swift-reconfigure.yml playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml
  4. Wait until replication has finished or min-part-hours has elapsed (whichever is longer). For more information see Section 9.5.4, “Determining When to Rebalance and Deploy a New Ring”

  5. Repeat the above steps until the ring is rebalanced.

9.5.5.5 System Changes that Change Existing Rings

There are many system changes ranging from adding servers to replacing drives, which might require you to rebuild and rebalance your rings.

Actions Process
Adding Servers(s)
Removing Server(s)

In SUSE OpenStack Cloud, when you remove servers from the input model, the disk drives are removed from the ring - the weight is not gradually reduced using the weight-step attribute.

  • Remove servers in phases:

    • This reduces the impact of the changes on your system.

    • If your rings use swift zones, ensure you remove the same number of servers for each zone at each phase.

Adding Disk Drive(s)
Replacing Disk Drive(s)

When a drive fails, replace it as soon as possible. Do not attempt to remove it from the ring - this creates operator overhead. swift will continue to store the correct number of replicas by handing off objects to other drives instead of the failed drive.

If the disk drives are of the same size as the original when the drive is replaced, no ring changes are required. You can confirm this by running the swift-update-from-model-rebalance-rings.yml playbook. It should report that no weight changes are needed.

For a single drive replacement, even if the drive is significantly larger than the original drives, you do not need to rebalance the ring (however, the extra space on the drive will not be used).

Upgrading Disk Drives

If the drives are different size (for example, you are upgrading your system), you can proceed as follows:

  • If not already done, set the weight-step attribute

  • Replace drives in phases:

    • Avoid replacing too many drives at once.

    • If your rings use swift zones, upgrade a number of drives in the same zone at the same time - not drives in several zones.

    • It is also safer to upgrade one server instead of drives in several servers at the same time.

    • Remember that the final size of all swift zones must be the same, so you may need to replace a small number of drives in one zone, then a small number in second zone, then return to the first zone and replace more drives, etc.

Removing Disk Drive(s)

When removing a disk drive from the input model, keep in mind that this drops the disk out of the ring without allowing Swift to move the data off it first. While it should be fine in a properly replicated healthy cluster, we do not recommend this approach. A better solution is to step down weight_step to 0 to allow Swift to move data.

9.5.6 Adding a New Swift Storage Policy

This page describes how to add an additional storage policy to an existing system. For an overview of storage policies, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 11 “Modifying Example Configurations for Object Storage using Swift”, Section 11.11 “Designing Storage Policies”.

To Add a Storage Policy

Perform the following steps to add the storage policy to an existing system.

  1. Log in to the Cloud Lifecycle Manager.

  2. Select a storage policy index and ring name.

    For example, if you already have object-0 and object-1 rings in your ring-specifications (usually in the ~/openstack/my_cloud/definition/data/swift/swift_config.yml file), the next index is 2 and the ring name is object-2.

  3. Select a user-visible name so that you can see when you examine container metadata or when you want to specify the storage policy used when you create a container. The name should be a single word (hyphen and dashes are allowed).

  4. Decide if this new policy will be the default for all new containers.

  5. Decide on other attributes such as partition-power and replica-count if you are using a standard replication ring. However, if you are using an erasure coded ring, you also need to decide on other attributes: ec-type, ec-num-data-fragments, ec-num-parity-fragments, and ec-object-segment-size. For more details on the required attributes, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 11 “Modifying Example Configurations for Object Storage using Swift”, Section 11.10 “Understanding Swift Ring Specifications”.

  6. Edit the ring-specifications attribute (usually in the ~/openstack/my_cloud/definition/data/swift/swift_config.yml file) and add the new ring specification. If this policy is to be the default storage policy for new containers, set the default attribute to yes.

    Note
    Note
    1. Ensure that only one object ring has the default attribute set to yes. If you set two rings as default, swift processes will not start.

    2. Do not specify the weight-step attribute for the new object ring. Since this is a new ring there is no need to gradually increase device weights.

  7. Update the appropriate disk model to use the new storage policy (for example, the data/disks_swobj.yml file). The following sample shows that the object-2 has been added to the list of existing rings that use the drives:

    disk-models:
    - name: SWOBJ-DISKS
      ...
      device-groups:
      - name: swobj
        devices:
           ...
        consumer:
            name: swift
            attrs:
                rings:
                - object-0
                - object-1
                - object-2
      ...
    Note
    Note

    You must use the new object ring on at least one node that runs the swift-object service. If you skip this step and continue to run the swift-compare-model-rings.yml or swift-deploy.yml playbooks, they will fail with an error There are no devices in this ring, or all devices have been deleted, as shown below:

    TASK: [swiftlm-ring-supervisor | build-rings | Build ring (make-delta, rebalance)] ***
    failed: [padawan-ccp-c1-m1-mgmt] => {"changed": true, "cmd": ["swiftlm-ring-supervisor", "--make-delta", "--rebalance"], "delta": "0:00:03.511929", "end": "2015-10-07 14:02:03.610226", "rc": 2, "start": "2015-10-07 14:02:00.098297", "warnings": []}
    ...
    Running: swift-ring-builder /etc/swiftlm/cloud1/cp1/builder_dir/object-2.builder rebalance 999
    ERROR: -------------------------------------------------------------------------------
    An error has occurred during ring validation. Common
    causes of failure are rings that are empty or do not
    have enough devices to accommodate the replica count.
    Original exception message:
    There are no devices in this ring, or all devices have been deleted
    -------------------------------------------------------------------------------
  8. Commit your configuration:

    ardana > git add -A
    ardana > git commit -m "commit message"
  9. Run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  10. Create a deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  11. Validate the changes by running the swift-compare-model-rings.yml playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts swift-compare-model-rings.yml

    If any errors occur, correct them. For instructions, see Section 18.6.2.3, “Interpreting Swift Input Model Validation Errors”. Then, re-run steps 5 - 10.

  12. Create the new ring (for example, object-2). Then verify the swift service status and reconfigure the swift node to use a new storage policy, by running these playbooks:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts swift-status.yml
    ardana > ansible-playbook -i hosts/verb_hosts swift-deploy.yml

After adding a storage policy, there is no need to rebalance the ring.

9.5.7 Changing min-part-hours in Swift

The min-part-hours parameter specifies the number of hours you must wait before swift will allow a given partition to be moved. In other words, it constrains how often you perform ring rebalance operations. Before changing this value, you should get some experience with how long it takes your system to perform replication after you make ring changes (for example, when you add servers).

See Section 9.5.4, “Determining When to Rebalance and Deploy a New Ring” for more information about determining when replication has completed.

9.5.7.1 Changing the min-part-hours Value

To change the min-part-hours value, following these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Edit your ~/openstack/my_cloud/definition/data/swift/swift_config.yml file and change the value(s) of min-part-hours for the rings you desire. The value is expressed in hours and a value of zero is not allowed.

  3. Commit your configuration to the local Git repository (Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”), as follows:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  4. Run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  5. Update your deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  6. Apply the changes by running this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts swift-deploy.yml

9.5.8 Changing Swift Zone Layout

Before changing the number of swift zones or the assignment of servers to specific zones, you must ensure that your system has sufficient storage available to perform the operation. Specifically, if you are adding a new zone, you may need additional storage. There are two reasons for this:

  • You cannot simply change the swift zone number of disk drives in the ring. Instead, you need to remove the server(s) from the ring and then re-add the server(s) with a new swift zone number to the ring. At the point where the servers are removed from the ring, there must be sufficient spare capacity on the remaining servers to hold the data that was originally hosted on the removed servers.

  • The total amount of storage in each swift zone must be the same. This is because new data is added to each zone at the same rate. If one zone has a lower capacity than the other zones, once that zone becomes full, you cannot add more data to the system – even if there is unused space in the other zones.

As mentioned above, you cannot simply change the swift zone number of disk drives in an existing ring. Instead, you must remove and then re-add servers. This is a summary of the process:

  1. Identify appropriate server groups that correspond to the desired swift zone layout.

  2. Remove the servers in a server group from the rings. This process may be protracted, either by removing servers in small batches or by using the weight-step attribute so that you limit the amount of replication traffic that happens at once.

  3. Once all the targeted servers are removed, edit the swift-zones attribute in the ring specifications to add or remove a swift zone.

  4. Re-add the servers you had temporarily removed to the rings. Again you may need to do this in batches or rely on the weight-step attribute.

  5. Continue removing and re-adding servers until you reach your final configuration.

9.5.8.1 Process for Changing Swift Zones

This section describes the detailed process or reorganizing swift zones. As a concrete example, we assume we start with a single swift zone and the target is three swift zones. The same general process would apply if you were reducing the number of zones as well.

The process is as follows:

  1. Identify the appropriate server groups that represent the desired final state. In this example, we are going to change the swift zone layout as follows:

    Original LayoutTarget Layout
    swift-zones:
      - 1d: 1
        server-groups:
           - AZ1
           - AZ2
           - AZ3
    swift-zones:
       - 1d: 1
         server-groups:
            - AZ1
       - id: 2
            - AZ2
       - id: 3
            - AZ3

    The plan is to move servers from server groups AZ2 and AZ3 to a new swift zone number. The servers in AZ1 will remain in swift zone 1.

  2. If you have not already done so, consider setting the weight-step attribute as described in Section 9.5.2, “Using the Weight-Step Attributes to Prepare for Ring Changes”.

  3. Identify the servers in the AZ2 server group. You may remove all servers at once or remove them in batches. If this is the first time you have performed a major ring change, we suggest you remove one or two servers only in the first batch. When you see how long this takes and the impact replication has on your system you can then use that experience to decide whether you can remove a larger batch of servers, or increase or decrease the weight-step attribute for the next server-removal cycle. To remove a server, use steps 2-9 as described in Section 15.1.5.1.4, “Removing a Swift Node” ensuring that you do not remove the servers from the input model.

  4. This process may take a number of ring rebalance cycles until the disk drives are removed from the ring files. Once this happens, you can edit the ring specifications and add swift zone 2 as shown in this example:

    swift-zones:
      - id: 1
        server-groups:
          - AZ1
          - AZ3
      - id: 2
           - AZ2
  5. The server removal process in step #3 set the "remove" attribute in the pass-through attribute of the servers in server group AZ2. Edit the input model files and remove this pass-through attribute. This signals to the system that the servers should be used the next time we rebalance the rings (that is, the server should be added to the rings).

  6. Commit your configuration to the local Git repository (Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”), as follows:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  7. Run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  8. Use the playbook to create a deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  9. Rebuild and deploy the swift rings containing the re-added servers by running this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts swift-deploy.yml
  10. Wait until replication has finished. For more details, see Section 9.5.4, “Determining When to Rebalance and Deploy a New Ring”.

  11. You may need to continue to rebalance the rings. For instructions, see the "Final Rebalance Stage" steps at Section 9.5.5, “Applying Input Model Changes to Existing Rings”.

  12. At this stage, the servers in server group AZ2 are responsible for swift zone 2. Repeat the process in steps #3-9 to remove the servers in server group AZ3 from the rings and then re-add them to swift zone 3. The ring specifications for zones (step 4) should be as follows:

    swift-zones:
      - 1d: 1
        server-groups:
          - AZ1
      - id: 2
          - AZ2
      - id: 3
          - AZ3
  13. Once complete, all data should be dispersed (that is, each replica is located) in the swift zones as specified in the input model.

9.6 Configuring your swift System to Allow Container Sync

swift has a feature where all the contents of a container can be mirrored to another container through background synchronization. swift operators configure their system to allow/accept sync requests to/from other systems, and the user specifies where to sync their container to along with a secret synchronization key. For an overview of this feature, refer to OpenStack swift - Container to Container Synchronization.

9.6.1 Notes and limitations

The container synchronization is done as a background action. When you put an object into the source container, it will take some time before it becomes visible in the destination container. Storage services will not necessarily copy objects in any particular order, meaning they may be transferred in a different order to which they were created.

Container sync may not be able to keep up with a moderate upload rate to a container. For example, if the average object upload rate to a container is greater than one object per second, then container sync may not be able to keep the objects synced.

If container sync is enabled on a container that already has a large number of objects then container sync may take a long time to sync the data. For example, a container with one million 1KB objects could take more than 11 days to complete a sync.

You may operate on the destination container just like any other container -- adding or deleting objects -- including the objects that are in the destination container because they were copied from the source container. To decide how to handle object creation, replacement or deletion, the system uses timestamps to determine what to do. In general, the latest timestamp "wins". That is, if you create an object, replace it, delete it and the re-create it, the destination container will eventually contain the most recently created object. However, if you also create and delete objects in the destination container, you get some subtle behaviours as follows:

  • If an object is copied to the destination container and then deleted, it remains deleted in the destination even though there is still a copy in the source container. If you modify the object (replace or change its metadata) in the source container, it will reappear in the destination again.

  • The same applies to a replacement or metadata modification of an object in the destination container -- the object will remain as-is unless there is a replacement or modification in the source container.

  • If you replace or modify metadata of an object in the destination container and then delete it in the source container, it is not deleted from the destination. This is because your modified object has a later timestamp than the object you deleted in the source.

  • If you create an object in the source container and before the system has a chance to copy it to the destination, you also create an object of the same name in the destination, then the object in the destination is not overwritten by the source container's object.

Segmented objects

Segmented objects (objects larger than 5GB) will not work seamlessly with container synchronization. If the manifest object is copied to the destination container before the object segments, when you perform a GET operation on the manifest object, the system may fail to find some or all of the object segments. If your manifest and object segments are in different containers, do not forget that both containers must be synchonized and that the container name of the object segments must be the same on both source and destination.

9.6.2 Prerequisites

Container to container synchronization requires that SSL certificates are configured on both the source and destination systems. For more information on how to implement SSL, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 41 “Configuring Transport Layer Security (TLS)”.

9.6.3 Configuring container sync

Container to container synchronization requires that both the source and destination swift systems involved be configured to allow/accept this. In the context of container to container synchronization, swift uses the term cluster to denote a swift system. swift clusters correspond to Control Planes in OpenStack terminology.

Gather the public API endpoints for both swift systems

Gather information about the external/public URL used by each system, as follows:

  1. On the Cloud Lifecycle Manager of one system, get the public API endpoint of the system by running the following commands:

    ardana > source ~/service.osrc
    ardana > openstack endpoint list | grep swift

    The output of the command will look similar to this:

    ardana > openstack endpoint list | grep swift
    | 063a84b205c44887bc606c3ba84fa608 | region0 | swift           | object-store    | True    | admin     | https://10.13.111.176:8080/v1/AUTH_%(tenant_id)s |
    | 3c46a9b2a5f94163bb5703a1a0d4d37b | region0 | swift           | object-store    | True    | public    | https://10.13.120.105:8080/v1/AUTH_%(tenant_id)s |
    | a7b2f4ab5ad14330a7748c950962b188 | region0 | swift           | object-store    | True    | internal  | https://10.13.111.176:8080/v1/AUTH_%(tenant_id)s |

    The portion that you want is the endpoint up to, but not including, the AUTH part. It is bolded in the above example, https://10.13.120.105:8080/v1.

  2. Repeat these steps on the other swift system so you have both of the public API endpoints for them.

Validate connectivity between both systems

The swift nodes running the swift-container service must be able to connect to the public API endpoints of each other for the container sync to work. You can validate connectivity on each system using these steps.

For the sake of the examples, we will use the terms source and destination to notate the nodes doing the synchronization.

  1. Log in to a swift node running the swift-container service on the source system. You can determine this by looking at the service list in your ~/openstack/my_cloud/info/service_info.yml file for a list of the servers containing this service.

  2. Verify the SSL certificates by running this command against the destination swift server:

    echo | openssl s_client -connect PUBLIC_API_ENDPOINT:8080 -CAfile /etc/ssl/certs/ca-certificates.crt

    If the connection was successful you should see a return code of 0 (ok) similar to this:

    ...
    Timeout   : 300 (sec)
    Verify return code: 0 (ok)
  3. Also verify that the source node can connect to the destination swift system using this command:

    ardana > curl -k DESTINATION_IP OR HOSTNAME:8080/healthcheck

    If the connection was successful, you should see a response of OK.

  4. Repeat these verification steps on any system involved in your container synchronization setup.

Configure container to container synchronization

Both the source and destination swift systems must be configured the same way, using sync realms. For more details on how sync realms work, see OpenStack swift - Configuring Container Sync.

To configure one of the systems, follow these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Edit the ~/openstack/my_cloud/config/swift/container-sync-realms.conf.j2 file and uncomment the sync realm section.

    Here is a sample showing this section in the file:

    #Add sync realms here, for example:
    # [realm1]
    # key = realm1key
    # key2 = realm1key2
    # cluster_name1 = https://host1/v1/
    # cluster_name2 = https://host2/v1/
  3. Add in the details for your source and destination systems. Each realm you define is a set of clusters that have agreed to allow container syncing between them. These values are case sensitive.

    Only one key is required. The second key is optional and can be provided to allow an operator to rotate keys if desired. The values for the clusters must contain the prefix cluster_ and will be populated with the public API endpoints for the systems.

  4. Commit the changes to git:

    ardana > git add -A
    ardana > git commit -a -m "Add node <name>"
  5. Run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  6. Update the deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  7. Run the swift reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml
  8. Run this command to validate that your container synchronization is configured:

    ardana > source ~/service.osrc
    ardana > swift capabilities

    Here is a snippet of the output showing the container sync information. This should be populated with your cluster names:

    ...
    Additional middleware: container_sync
     Options:
      realms: {u'INTRACLUSTER': {u'clusters': {u'THISCLUSTER': {}}}}
  9. Repeat these steps on any other swift systems that will be involved in your sync realms.

9.6.4 Configuring Intra Cluster Container Sync

It is possible to use the swift container sync functionality to sync objects between containers within the same swift system. swift is automatically configured to allow intra cluster container sync. Each swift PAC server will have an intracluster container sync realm defined in /etc/swift/container-sync-realms.conf.

For example:

# The intracluster realm facilitates syncing containers on this system
[intracluster]
key = lQ8JjuZfO
# key2 =
cluster_thiscluster = http://SWIFT-PROXY-VIP:8080/v1/

The keys defined in /etc/swift/container-sync-realms.conf are used by the container-sync daemon to determine trust. On top of this the containers that will be in sync will need a seperate shared key they both define in container metadata to establish their trust between each other.

  1. Create two containers, for example container-src and container-dst. In this example we will sync one way from container-src to container-dst.

    ardana > openstack container create container-src
    ardana > openstack container create container-dst
  2. Determine your swift account. In the following example it is AUTH_1234

    ardana > openstack container show
                                     Account: AUTH_1234
                                  Containers: 3
                                     Objects: 42
                                       Bytes: 21692421
    Containers in policy "erasure-code-ring": 3
       Objects in policy "erasure-code-ring": 42
         Bytes in policy "erasure-code-ring": 21692421
                                Content-Type: text/plain; charset=utf-8
                 X-Account-Project-Domain-Id: default
                                 X-Timestamp: 1472651418.17025
                                  X-Trans-Id: tx81122c56032548aeae8cd-0057cee40c
                               Accept-Ranges: bytes
  3. Configure container-src to sync to container-dst using a key specified by both containers. Replace KEY with your key.

    ardana > openstack container set -t '//intracluster/thiscluster/AUTH_1234/container-dst' -k 'KEY' container-src
  4. Configure container-dst to accept synced objects with this key

    ardana > openstack container set -k 'KEY' container-dst
  5. Upload objects to container-src. Within a number of minutes the objects should be automatically synced to container-dst.

Changing the intracluster realm key

The intracluster realm key used by container sync to sync objects between containers in the same swift system is automatically generated. The process for changing passwords is described in Section 5.7, “Changing Service Passwords”.

The steps to change the intracluster realm key are as follows.

  1. On the Cloud Lifecycle Manager create a file called ~/openstack/change_credentials/swift_data_metadata.yml with the contents included below. The consuming-cp and cp are the control plane name specified in ~/openstack/my_cloud/definition/data/control_plane.yml where the swift-container service is running.

    swift_intracluster_sync_key:
     metadata:
     - clusters:
       - swpac
       component: swift-container
       consuming-cp: control-plane-1
       cp: control-plane-1
     version: '2.0'
  2. Run the following commands

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  3. Reconfigure the swift credentials

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts swift-reconfigure-credentials-change.yml
  4. Delete ~/openstack/change_credentials/swift_data_metadata.yml

    ardana > rm ~/openstack/change_credentials/swift_data_metadata.yml
  5. On a swift PAC server check that the intracluster realm key has been updated in /etc/swift/container-sync-realms.conf

    # The intracluster realm facilitates syncing containers on this system
    [intracluster]
    key = aNlDn3kWK
  6. Update any containers using the intracluster container sync to use the new intracluster realm key

    ardana > openstack container set -k 'aNlDn3kWK' container-src
    ardana > openstack container set -k 'aNlDn3kWK' container-dst

10 Managing Networking

Information about managing and configuring the Networking service.

10.1 SUSE OpenStack Cloud Firewall

Firewall as a Service (FWaaS) provides the ability to assign network-level, port security for all traffic entering an existing tenant network. More information on this service can be found in the public OpenStack documentation located at http://specs.openstack.org/openstack/neutron-specs/specs/api/firewall_as_a_service__fwaas_.html. The following documentation provides command-line interface example instructions for configuring and testing a SUSE OpenStack Cloud firewall. FWaaS can also be configured and managed by the horizon web interface.

With SUSE OpenStack Cloud, FWaaS is implemented directly in the L3 agent (neutron-l3-agent). However if VPNaaS is enabled, FWaaS is implemented in the VPNaaS agent (neutron-vpn-agent). Because FWaaS does not use a separate agent process or start a specific service, there currently are no monasca alarms for it.

If DVR is enabled, the firewall service currently does not filter traffic between OpenStack private networks, also known as east-west traffic and will only filter traffic from external networks, also known as north-south traffic.

Note
Note

The L3 agent must be restarted on each compute node hosting a DVR router when removing the FWaaS or adding a new FWaaS. This condition only applies when updating existing instances connected to DVR routers. For more information, see the upstream bug.

10.1.1 Overview of the SUSE OpenStack Cloud Firewall configuration

The following instructions provide information about how to identify and modify the overall SUSE OpenStack Cloud firewall that is configured in front of the control services. This firewall is administered only by a cloud admin and is not available for tenant use for private network firewall services.

During the installation process, the configuration processor will automatically generate "allow" firewall rules for each server based on the services deployed and block all other ports. These are populated in ~/openstack/my_cloud/info/firewall_info.yml, which includes a list of all the ports by network, including the addresses on which the ports will be opened. This is described in more detail in Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 5 “Input Model”, Section 5.2 “Concepts”, Section 5.2.10 “Networking”, Section 5.2.10.5 “Firewall Configuration”.

The firewall_rules.yml file in the input model allows you to define additional rules for each network group. You can read more about this in Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”, Section 6.15 “Firewall Rules”.

The purpose of this document is to show you how to make post-installation changes to the firewall rules if the need arises.

Important
Important

This process is not to be confused with Firewall-as-a-Service, which is a separate service that enables the ability for SUSE OpenStack Cloud tenants to create north-south, network-level firewalls to provide stateful protection to all instances in a private, tenant network. This service is optional and is tenant-configured.

10.1.2 SUSE OpenStack Cloud 9 FWaaS Configuration

Check for an enabled firewall.

  1. You should check to determine if the firewall is enabled. The output of the openstack extension list should contain a firewall entry.

    openstack extension list
  2. Assuming the external network is already created by the admin, this command will show the external network.

    openstack network list

Create required assets.

Before creating firewalls, you will need to create a network, subnet, router, security group rules, start an instance and assign it a floating IP address.

  1. Create the network, subnet and router.

    openstack network create private
    openstack subnet create --name sub private 10.0.0.0/24 --gateway 10.0.0.1
    openstack router create router
    openstack router add subnet router sub
    openstack router set router ext-net
  2. Create security group rules. Security group rules filter traffic at VM level.

    openstack security group rule create default --protocol icmp
    openstack security group rule create default --protocol tcp --port-range-min 22 --port-range-max 22
    openstack security group rule create default --protocol tcp --port-range-min 80 --port-range-max 80
  3. Boot a VM.

    NET=$(openstack network list | awk '/private/ {print $2}')
    openstack server create --flavor 1 --image <image> --nic net-id=$NET vm1 --poll
  4. Verify if the instance is ACTIVE and is assigned an IP address.

    openstack server list
  5. Get the port id of the vm1 instance.

    fixedip=$(openstack server list | awk '/vm1/ {print $12}' | awk -F '=' '{print $2}' | awk -F ',' '{print $1}')
    vmportuuid=$(openstack port list | grep $fixedip | awk '{print $2}')
  6. Create and associate a floating IP address to the vm1 instance.

    openstack floating ip create ext-net --port-id $vmportuuid
  7. Verify if the floating IP is assigned to the instance. The following command should show an assigned floating IP address from the external network range.

    openstack server show vm1
  8. Verify if the instance is reachable from the external network. SSH into the instance from a node in (or has route to) the external network.

    ssh cirros@FIP-VM1
    password: <password>

Create and attach the firewall.

Note
Note

By default, an internal "drop all" rule is enabled in IP tables if none of the defined rules match the real-time data packets.

  1. Create new firewall rules using firewall-rule-create command and providing the protocol, action (allow, deny, reject) and name for the new rule.

    Firewall actions provide rules in which data traffic can be handled. An allow rule will allow traffic to pass through the firewall, deny will stop and prevent data traffic from passing through the firewall and reject will reject the data traffic and return a destination-unreachable response. Using reject will speed up failure detection time dramatically for legitimate users, since they will not be required to wait for retransmission timeouts or submit retries. Some customers should stick with deny where prevention of port scanners and similar methods may be attempted by hostile attackers. Using deny will drop all of the packets, making it more difficult for malicious intent. The firewall action, deny is the default behavior.

    The example below demonstrates how to allow icmp and ssh while denying access to http. See the OpenStackClient command-line reference at https://docs.openstack.org/python-openstackclient/rocky/ on additional options such as source IP, destination IP, source port and destination port.

    Note
    Note

    You can create a firewall rule with an identical name and each instance will have a unique id associated with the created rule, however for clarity purposes this is not recommended.

    neutron firewall-rule-create --protocol icmp --action allow --name allow-icmp
    neutron firewall-rule-create --protocol tcp --destination-port 80 --action deny --name deny-http
    neutron firewall-rule-create --protocol tcp --destination-port 22 --action allow --name allow-ssh
  2. Once the rules are created, create the firewall policy by using the firewall-policy-create command with the --firewall-rules option and rules to include in quotes, followed by the name of the new policy. The order of the rules is important.

    neutron firewall-policy-create --firewall-rules "allow-icmp deny-http allow-ssh" policy-fw
  3. Finish the firewall creation by using the firewall-create command, the policy name and the new name you want to give to your new firewall.

    neutron firewall-create policy-fw --name user-fw
  4. You can view the details of your new firewall by using the firewall-show command and the name of your firewall. This will verify that the status of the firewall is ACTIVE.

    neutron firewall-show user-fw

Verify the FWaaS is functional.

  1. Since allow-icmp firewall rule is set you can ping the floating IP address of the instance from the external network.

    ping <FIP-VM1>
  2. Similarly, you can connect via ssh to the instance due to the allow-ssh firewall rule.

    ssh cirros@<FIP-VM1>
    password: <password>
  3. Run a web server on vm1 instance that listens over port 80, accepts requests and sends a WELCOME response.

    $ vi webserv.sh
    
    #!/bin/bash
    
    MYIP=$(/sbin/ifconfig eth0|grep 'inet addr'|awk -F: '{print $2}'| awk '{print $1}');
    while true; do
      echo -e "HTTP/1.0 200 OK
    
    Welcome to $MYIP" | sudo nc -l -p 80
    done
    
    # Give it Exec rights
    $ chmod 755 webserv.sh
    
    # Execute the script
    $ ./webserv.sh
  4. You should expect to see curl fail over port 80 because of the deny-http firewall rule. If curl succeeds, the firewall is not blocking incoming http requests.

    curl -vvv <FIP-VM1>
Warning
Warning

When using reference implementation, new networks, FIPs and routers created after the Firewall creation will not be automatically updated with firewall rules. Thus, execute the firewall-update command by passing the current and new router Ids such that the rules are reconfigured across all the routers (both current and new).

For example if router-1 is created before and router-2 is created after the firewall creation

$ neutron firewall-update —router <router-1-id> —router <router-2-id> <firewall-name>

10.1.3 Making Changes to the Firewall Rules

  1. Log in to your Cloud Lifecycle Manager.

  2. Edit your ~/openstack/my_cloud/definition/data/firewall_rules.yml file and add the lines necessary to allow the port(s) needed through the firewall.

    In this example we are going to open up port range 5900-5905 to allow VNC traffic through the firewall:

      - name: VNC
        network-groups:
      - MANAGEMENT
        rules:
         - type: allow
           remote-ip-prefix:  0.0.0.0/0
           port-range-min: 5900
           port-range-max: 5905
           protocol: tcp
    Note
    Note

    The example above shows a remote-ip-prefix of 0.0.0.0/0 which opens the ports up to all IP ranges. To be more secure you can specify your local IP address CIDR you will be running the VNC connect from.

  3. Commit those changes to your local git:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "firewall rule update"
  4. Run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  5. Create the deployment directory structure:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  6. Change to the deployment directory and run the osconfig-iptables-deploy.yml playbook to update your iptable rules to allow VNC:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts osconfig-iptables-deploy.yml

You can repeat these steps as needed to add, remove, or edit any of these firewall rules.

10.1.4 More Information

Firewalls are based in IPtable settings.

Each firewall that is created is known as an instance.

A firewall instance can be deployed on selected project routers. If no specific project router is selected, a firewall instance is automatically applied to all project routers.

Only 1 firewall instance can be applied to a project router.

Only 1 firewall policy can be applied to a firewall instance.

Multiple firewall rules can be added and applied to a firewall policy.

Firewall rules can be shared across different projects via the Share API flag.

Firewall rules supersede the Security Group rules that are applied at the Instance level for all traffic entering or leaving a private, project network.

For more information on the command-line interface (CLI) and firewalls, see the OpenStack networking command-line client reference: https://docs.openstack.org/python-openstackclient/rocky/

10.2 Using VPN as a Service (VPNaaS)

SUSE OpenStack Cloud 9 VPNaaS Configuration

This document describes the configuration process and requirements for the SUSE OpenStack Cloud 9 Virtual Private Network (VPN) as a Service (VPNaaS) module.

10.2.1 Prerequisites

  1. SUSE OpenStack Cloud must be installed.

  2. Before setting up VPNaaS, you will need to have created an external network and a subnet with access to the internet. Information on how to create the external network and subnet can be found in Section 10.2.4, “More Information”.

  3. You should assume 172.16.0.0/16 as the ext-net CIDR in this document.

10.2.2 Considerations

Using the neutron plugin-based VPNaaS causes additional processes to be run on the Network Service Nodes. One of these processes, the ipsec charon process from StrongSwan, runs as root and listens on an external network. A vulnerability in that process can lead to remote root compromise of the Network Service Nodes. If this is a concern customers should consider using a VPN solution other than the neutron plugin-based VPNaaS and/or deploying additional protection mechanisms.

10.2.3 Configuration

Setup Networks You can setup VPN as a Service (VPNaaS) by first creating networks, subnets and routers using the neutron command line. The VPNaaS module enables the ability to extend access between private networks across two different SUSE OpenStack Cloud clouds or between a SUSE OpenStack Cloud cloud and a non-cloud network. VPNaaS is based on the open source software application called StrongSwan. StrongSwan (more information available at http://www.strongswan.org/) is an IPsec implementation and provides basic VPN gateway functionality.

Note
Note

You can execute the included commands from any shell with access to the service APIs. In the included examples, the commands are executed from the lifecycle manager, however you could execute the commands from the controller node or any other shell with aforementioned service API access.

Note
Note

The use of floating IP's is not possible with the current version of VPNaaS when DVR is enabled. Ensure that no floating IP is associated to instances that will be using VPNaaS when using a DVR router. Floating IP associated to instances are ok when using CVR router.

  1. From the Cloud Lifecycle Manager, create first private network, subnet and router assuming that ext-net is created by admin.

    openstack network create privateA
    openstack subnet create --name subA privateA 10.1.0.0/24 --gateway 10.1.0.1
    openstack router create router1
    openstack router add subnet router1 subA
    openstack router set router1 ext-net
  2. Create second private network, subnet and router.

    openstack network create privateB
    openstack subnet create --name subB privateB 10.2.0.0/24 --gateway 10.2.0.1
    openstack router create router2
    openstack router add subnet router2 subB
    openstack router set router2 ext-net
Procedure 10.1: Starting Virtual Machines
  1. From the Cloud Lifecycle Manager run the following to start the virtual machines. Begin with adding secgroup rules for SSH and ICMP.

    openstack security group rule create default --protocol icmp
    openstack security group rule create default --protocol tcp --port-range-min 22 --port-range-max 22
  2. Start the virtual machine in the privateA subnet. Using nova images-list, use the image id to boot image instead of the image name. After executing this step, it is recommended that you wait approximately 10 seconds to allow the virtual machine to become active.

    NETA=$(openstack network list | awk '/privateA/ {print $2}')
    openstack server create --flavor 1 --image <id> --nic net-id=$NETA vm1
  3. Start the virtual machine in the privateB subnet.

    NETB=$(openstack network list | awk '/privateB/ {print $2}')
    openstack server create --flavor 1 --image <id> --nic net-id=$NETB vm2
  4. Verify private IP's are allocated to the respective vms. Take note of IP's for later use.

    openstack server show vm1
    openstack server show vm2
Procedure 10.2: Create VPN
  1. You can set up the VPN by executing the below commands from the lifecycle manager or any shell with access to the service APIs. Begin with creating the policies with vpn-ikepolicy-create and vpn-ipsecpolicy-create .

    neutron vpn-ikepolicy-create ikepolicy
    neutron vpn-ipsecpolicy-create ipsecpolicy
  2. Create the VPN service at router1.

    neutron vpn-service-create --name myvpnA --description "My vpn service" router1 subA
  3. Wait at least 5 seconds and then run ipsec-site-connection-create to create a ipsec-site connection. Note that --peer-address is the assign ext-net IP from router2 and --peer-cidr is subB cidr.

    neutron ipsec-site-connection-create --name vpnconnection1 --vpnservice-id myvpnA \
    --ikepolicy-id ikepolicy --ipsecpolicy-id ipsecpolicy --peer-address 172.16.0.3 \
    --peer-id 172.16.0.3 --peer-cidr 10.2.0.0/24 --psk secret
  4. Create the VPN service at router2.

    neutron vpn-service-create --name myvpnB --description "My vpn serviceB" router2 subB
  5. Wait at least 5 seconds and then run ipsec-site-connection-create to create a ipsec-site connection. Note that --peer-address is the assigned ext-net IP from router1 and --peer-cidr is subA cidr.

    neutron ipsec-site-connection-create --name vpnconnection2 --vpnservice-id myvpnB \
    --ikepolicy-id ikepolicy --ipsecpolicy-id ipsecpolicy --peer-address 172.16.0.2 \
    --peer-id 172.16.0.2 --peer-cidr 10.1.0.0/24 --psk secret
  6. On the Cloud Lifecycle Manager, run the ipsec-site-connection-list command to see the active connections. Be sure to check that the vpn_services are ACTIVE. You can check this by running vpn-service-list and then checking ipsec-site-connections status. You should expect that the time for both vpn-services and ipsec-site-connections to become ACTIVE could take as long as 1 to 3 minutes.

    neutron ipsec-site-connection-list
    +--------------------------------------+----------------+--------------+---------------+------------+-----------+--------+
    | id                                   | name           | peer_address | peer_cidrs    | route_mode | auth_mode | status |
    +--------------------------------------+----------------+--------------+---------------+------------+-----------+--------+
    | 1e8763e3-fc6a-444c-a00e-426a4e5b737c | vpnconnection2 | 172.16.0.2   | "10.1.0.0/24" | static     | psk       | ACTIVE |
    | 4a97118e-6d1d-4d8c-b449-b63b41e1eb23 | vpnconnection1 | 172.16.0.3   | "10.2.0.0/24" | static     | psk       | ACTIVE |
    +--------------------------------------+----------------+--------------+---------------+------------+-----------+--------+

Verify VPN In the case of non-admin users, you can verify the VPN connection by pinging the virtual machines.

  1. Check the VPN connections.

    Note
    Note

    vm1-ip and vm2-ip denotes private IP's for vm1 and vm2 respectively. The private IPs are obtained, as described in of Step 4. If you are unable to SSH to the private network due to a lack of direct access, the VM console can be accessed through horizon.

    ssh cirros@vm1-ip
    password: <password>
    
    # ping the private IP address of vm2
    ping ###.###.###.###
  2. In another terminal.

    ssh cirros@vm2-ip
    password: <password>
    
    # ping the private IP address of vm1
    ping ###.###.###.###
  3. You should see ping responses from both virtual machines.

As the admin user, you should check to make sure that a route exists between the router gateways. Once the gateways have been checked, packet encryption can be verified by using traffic analyzer (tcpdump) by tapping on the respective namespace (qrouter-* in case of non-DVR and snat-* in case of DVR) and tapping the right interface (qg-***).

Note
Note

When using DVR namespaces, all the occurrences of qrouter-xxxxxx in the following commands should be replaced with respective snat-xxxxxx.

  1. Check the if the route exists between two router gateways. You can get the right qrouter namespace id by executing sudo ip netns. Once you have the qrouter namespace id, you can get the interface by executing sudo ip netns qrouter-xxxxxxxx ip addr and from the result the interface can be found.

    sudo ip netns
    sudo ip netns exec qrouter-<router1 UUID> ping <router2 gateway>
    sudo ip netns exec qrouter-<router2 UUID> ping <router1 gateway>
  2. Initiate a tcpdump on the interface.

    sudo ip netns exec qrouter-xxxxxxxx tcpdump -i qg-xxxxxx
  3. Check the VPN connection.

    ssh cirros@vm1-ip
    password: <password>
    
    # ping the private IP address of vm2
    ping ###.###.###.###
  4. Repeat for other namespace and right tap interface.

    sudo ip netns exec qrouter-xxxxxxxx tcpdump -i qg-xxxxxx
  5. In another terminal.

    ssh cirros@vm2-ip
    password: <password>
    
    # ping the private IP address of vm1
    ping ###.###.###.###
  6. You will find encrypted packets containing ‘ESP’ in the tcpdump trace.

10.2.4 More Information

VPNaaS currently only supports Pre-shared Keys (PSK) security between VPN gateways. A different VPN gateway solution should be considered if stronger, certificate-based security is required.

For more information on the neutron command-line interface (CLI) and VPN as a Service (VPNaaS), see the OpenStack networking command-line client reference: https://docs.openstack.org/python-openstackclient/rocky/

For information on how to create an external network and subnet, see the OpenStack manual: http://docs.openstack.org/user-guide/dashboard_create_networks.html

10.3 DNS Service Overview

SUSE OpenStack Cloud DNS service provides multi-tenant Domain Name Service with REST API management for domain and records.

Warning
Warning

The DNS Service is not intended to be used as an internal or private DNS service. The name records in DNSaaS should be treated as public information that anyone could query. There are controls to prevent tenants from creating records for domains they do not own. TSIG provides a Transaction SIG nature to ensure integrity during zone transfer to other DNS servers.

10.3.1 For More Information

10.3.2 designate Initial Configuration

After the SUSE OpenStack Cloud installation has been completed, designate requires initial configuration to operate.

10.3.2.1 Identifying Name Server Public IPs

Depending on the back-end, the method used to identify the name servers' public IPs will differ.

10.3.2.1.1 InfoBlox

InfoBlox will act as your public name servers, consult the InfoBlox management UI to identify the IPs.

10.3.2.1.2 BIND Back-end

You can find the name server IPs in /etc/hosts by looking for the ext-api addresses, which are the addresses of the controllers. For example:

192.168.10.1 example-cp1-c1-m1-extapi
192.168.10.2 example-cp1-c1-m2-extapi
192.168.10.3 example-cp1-c1-m3-extapi
10.3.2.1.3 Creating Name Server A Records

Each name server requires a public name, for example ns1.example.com., to which designate-managed domains will be delegated. There are two common locations where these may be registered, either within a zone hosted on designate itself, or within a zone hosted on a external DNS service.

If you are using an externally managed zone for these names:

  • For each name server public IP, create the necessary A records in the external system.

If you are using a designate-managed zone for these names:

  1. Create the zone in designate which will contain the records:

    ardana > openstack zone create --email hostmaster@example.com example.com.
    +----------------+--------------------------------------+
    | Field          | Value                                |
    +----------------+--------------------------------------+
    | action         | CREATE                               |
    | created_at     | 2016-03-09T13:16:41.000000           |
    | description    | None                                 |
    | email          | hostmaster@example.com               |
    | id             | 23501581-7e34-4b88-94f4-ad8cec1f4387 |
    | masters        |                                      |
    | name           | example.com.                         |
    | pool_id        | 794ccc2c-d751-44fe-b57f-8894c9f5c842 |
    | project_id     | a194d740818942a8bea6f3674e0a3d71     |
    | serial         | 1457529400                           |
    | status         | PENDING                              |
    | transferred_at | None                                 |
    | ttl            | 3600                                 |
    | type           | PRIMARY                              |
    | updated_at     | None                                 |
    | version        | 1                                    |
    +----------------+--------------------------------------+
  2. For each name server public IP, create an A record. For example:

    ardana > openstack recordset create --records 192.168.10.1 --type A example.com. ns1.example.com.
    +-------------+--------------------------------------+
    | Field       | Value                                |
    +-------------+--------------------------------------+
    | action      | CREATE                               |
    | created_at  | 2016-03-09T13:18:36.000000           |
    | description | None                                 |
    | id          | 09e962ed-6915-441a-a5a1-e8d93c3239b6 |
    | name        | ns1.example.com.                     |
    | records     | 192.168.10.1                         |
    | status      | PENDING                              |
    | ttl         | None                                 |
    | type        | A                                    |
    | updated_at  | None                                 |
    | version     | 1                                    |
    | zone_id     | 23501581-7e34-4b88-94f4-ad8cec1f4387 |
    +-------------+--------------------------------------+
  3. When records have been added, list the record sets in the zone to validate:

    ardana > openstack recordset list example.com.
    +--------------+------------------+------+---------------------------------------------------+
    | id           | name             | type | records                                           |
    +--------------+------------------+------+---------------------------------------------------+
    | 2d6cf...655b | example.com.     | SOA  | ns1.example.com. hostmaster.example.com 145...600 |
    | 33466...bd9c | example.com.     | NS   | ns1.example.com.                                  |
    | da98c...bc2f | example.com.     | NS   | ns2.example.com.                                  |
    | 672ee...74dd | example.com.     | NS   | ns3.example.com.                                  |
    | 09e96...39b6 | ns1.example.com. | A    | 192.168.10.1                                      |
    | bca4f...a752 | ns2.example.com. | A    | 192.168.10.2                                      |
    | 0f123...2117 | ns3.example.com. | A    | 192.168.10.3                                      |
    +--------------+------------------+------+---------------------------------------------------+
  4. Contact your domain registrar requesting Glue Records to be registered in the com. zone for the nameserver and public IP address pairs above. If you are using a sub-zone of an existing company zone (for example, ns1.cloud.mycompany.com.), the Glue must be placed in the mycompany.com. zone.

10.3.2.1.4 For More Information

For additional DNS integration and configuration information, see the OpenStack designate documentation at https://docs.openstack.org/designate/rocky/.

For more information on creating servers, domains and examples, see the OpenStack REST API documentation at https://developer.openstack.org/api-ref/dns/.

10.3.3 DNS Service Monitoring Support

10.3.3.1 DNS Service Monitoring Support

Additional monitoring support for the DNS Service (designate) has been added to SUSE OpenStack Cloud.

In the Networking section of the Operations Console, you can see alarms for all of the DNS Services (designate), such as designate-zone-manager, designate-api, designate-pool-manager, designate-mdns, and designate-central after running designate-stop.yml.

You can run designate-start.yml to start the DNS Services back up and the alarms will change from a red status to green and be removed from the New Alarms panel of the Operations Console.

An example of the generated alarms from the Operations Console is provided below after running designate-stop.yml:

ALARM:  STATE:  ALARM ID:  LAST CHECK:  DIMENSION:
Process Check
0f221056-1b0e-4507-9a28-2e42561fac3e 2016-10-03T10:06:32.106Z hostname=ardana-cp1-c1-m1-mgmt,
service=dns,
cluster=cluster1,
process_name=designate-zone-manager,
component=designate-zone-manager,
control_plane=control-plane-1,
cloud_name=entry-scale-kvm

Process Check
50dc4c7b-6fae-416c-9388-6194d2cfc837 2016-10-03T10:04:32.086Z hostname=ardana-cp1-c1-m1-mgmt,
service=dns,
cluster=cluster1,
process_name=designate-api,
component=designate-api,
control_plane=control-plane-1,
cloud_name=entry-scale-kvm

Process Check
55cf49cd-1189-4d07-aaf4-09ed08463044 2016-10-03T10:05:32.109Z hostname=ardana-cp1-c1-m1-mgmt,
service=dns,
cluster=cluster1,
process_name=designate-pool-manager,
component=designate-pool-manager,
control_plane=control-plane-1,
cloud_name=entry-scale-kvm

Process Check
c4ab7a2e-19d7-4eb2-a9e9-26d3b14465ea 2016-10-03T10:06:32.105Z hostname=ardana-cp1-c1-m1-mgmt,
service=dns,
cluster=cluster1,
process_name=designate-mdns,
component=designate-mdns,
control_plane=control-plane-1,
cloud_name=entry-scale-kvm
HTTP Status
c6349bbf-4fd1-461a-9932-434169b86ce5 2016-10-03T10:05:01.731Z service=dns,
cluster=cluster1,
url=http://100.60.90.3:9001/,
hostname=ardana-cp1-c1-m3-mgmt,
component=designate-api,
control_plane=control-plane-1,
api_endpoint=internal,
cloud_name=entry-scale-kvm,
monitored_host_type=instance

Process Check
ec2c32c8-3b91-4656-be70-27ff0c271c89 2016-10-03T10:04:32.082Z hostname=ardana-cp1-c1-m1-mgmt,
service=dns,
cluster=cluster1,
process_name=designate-central,
component=designate-central,
control_plane=control-plane-1,
cloud_name=entry-scale-kvm

10.4 Networking Service Overview

SUSE OpenStack Cloud Networking is a virtual Networking service that leverages the OpenStack neutron service to provide network connectivity and addressing to SUSE OpenStack Cloud Compute service devices.

The Networking service also provides an API to configure and manage a variety of network services.

You can use the Networking service to connect guest servers or you can define and configure your own virtual network topology.

10.4.1 Installing the Networking Service

SUSE OpenStack Cloud Network Administrators are responsible for planning for the neutron Networking service, and once installed, to configure the service to meet the needs of their cloud network users.

10.4.2 Working with the Networking service

To perform tasks using the Networking service, you can use the dashboard, API or CLI.

10.4.3 Reconfiguring the Networking service

If you change any of the network configuration after installation, it is recommended that you reconfigure the Networking service by running the neutron-reconfigure playbook.

On the Cloud Lifecycle Manager:

ardana > cd ~/openstack/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts neutron-reconfigure.yml

10.4.4 For more information

For information on how to operate your cloud we suggest you read the OpenStack Operations Guide. The Architecture section contains useful information about how an OpenStack Cloud is put together. However, SUSE OpenStack Cloud takes care of these details for you. The Operations section contains information on how to manage the system.

10.4.5 Neutron External Networks

10.4.5.1 External networks overview

This topic explains how to create a neutron external network.

External networks provide access to the internet.

The typical use is to provide an IP address that can be used to reach a VM from an external network which can be a public network like the internet or a network that is private to an organization.

10.4.5.2 Using the Ansible Playbook

This playbook will query the Networking service for an existing external network, and then create a new one if you do not already have one. The resulting external network will have the name ext-net with a subnet matching the CIDR you specify in the command below.

If you need to specify more granularity, for example specifying an allocation pool for the subnet, use the Section 10.4.5.3, “Using the python-neutronclient CLI”.

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts neutron-cloud-configure.yml -e EXT_NET_CIDR=<CIDR>

The table below shows the optional switch that you can use as part of this playbook to specify environment-specific information:

SwitchDescription

-e EXT_NET_CIDR=<CIDR>

Optional. You can use this switch to specify the external network CIDR. If you choose not to use this switch, or use a wrong value, the VMs will not be accessible over the network.

This CIDR will be from the EXTERNAL VM network.

10.4.5.3 Using the python-neutronclient CLI

For more granularity you can utilize the OpenStackClient tool to create your external network.

  1. Log in to the Cloud Lifecycle Manager.

  2. Source the Admin creds:

    ardana > source ~/service.osrc
  3. Create the external network and then the subnet using these commands below.

    Creating the network:

    ardana > openstack network create --router:external <external-network-name>

    Creating the subnet:

    ardana > openstack subnet create EXTERNAL-NETWORK-NAME CIDR --gateway GATEWAY --allocation-pool start=IP_START,end=IP_END [--disable-dhcp]

    Where:

    ValueDescription
    external-network-name

    This is the name given to your external network. This is a unique value that you will choose. The value ext-net is usually used.

    CIDR

    Use this switch to specify the external network CIDR. If you do not use this switch or use a wrong value, the VMs will not be accessible over the network.

    This CIDR will be from the EXTERNAL VM network.

    --gateway

    Optional switch to specify the gateway IP for your subnet. If this is not included, it will choose the first available IP.

    --allocation-pool start end

    Optional switch to specify start and end IP addresses to use as the allocation pool for this subnet.

    --disable-dhcp

    Optional switch if you want to disable DHCP on this subnet. If this is not specified, DHCP will be enabled.

10.4.5.4 Multiple External Networks

SUSE OpenStack Cloud provides the ability to have multiple external networks, by using the Network Service (neutron) provider networks for external networks. You can configure SUSE OpenStack Cloud to allow the use of provider VLANs as external networks by following these steps.

  1. Do NOT include the neutron.l3_agent.external_network_bridge tag in the network_groups definition for your cloud. This results in the l3_agent.ini external_network_bridge being set to an empty value (rather than the traditional br-ex).

  2. Configure your cloud to use provider VLANs, by specifying the provider_physical_network tag on one of the network_groups defined for your cloud.

    For example, to run provider VLANS over the EXAMPLE network group: (some attributes omitted for brevity)

    network-groups:
    
      - name: EXAMPLE
        tags:
          - neutron.networks.vlan:
              provider-physical-network: physnet1
  3. After the cloud has been deployed, you can create external networks using provider VLANs.

    For example, using the OpenStackClient:

    1. Create external network 1 on vlan101

      ardana > openstack network create --provider-network-type vlan
      --provider-physical-network physnet1 --provider-segment 101 --external ext-net1
    2. Create external network 2 on vlan102

      ardana > openstack network create --provider-network-type vlan
      --provider-physical-network physnet1 --provider-segment 102 --external ext-net2

10.4.6 Neutron Provider Networks

This topic explains how to create a neutron provider network.

A provider network is a virtual network created in the SUSE OpenStack Cloud cloud that is consumed by SUSE OpenStack Cloud services. The distinctive element of a provider network is that it does not create a virtual router; rather, it depends on L3 routing that is provided by the infrastructure.

A provider network is created by adding the specification to the SUSE OpenStack Cloud input model. It consists of at least one network and one or more subnets.

10.4.6.1 SUSE OpenStack Cloud input model

The input model is the primary mechanism a cloud admin uses in defining a SUSE OpenStack Cloud installation. It exists as a directory with a data subdirectory that contains YAML files. By convention, any service that creates a neutron provider network will create a subdirectory under the data directory and the name of the subdirectory shall be the project name. For example, the Octavia project will use neutron provider networks so it will have a subdirectory named 'octavia' and the config file that specifies the neutron network will exist in that subdirectory.

├── cloudConfig.yml
    ├── data
    │   ├── control_plane.yml
    │   ├── disks_compute.yml
    │   ├── disks_controller_1TB.yml
    │   ├── disks_controller.yml
    │   ├── firewall_rules.yml
    │   ├── net_interfaces.yml
    │   ├── network_groups.yml
    │   ├── networks.yml
    │   ├── neutron
    │   │   └── neutron_config.yml
    │   ├── nic_mappings.yml
    │   ├── server_groups.yml
    │   ├── server_roles.yml
    │   ├── servers.yml
    │   ├── swift
    │   │   └── swift_config.yml
    │   └── octavia
    │       └── octavia_config.yml
    ├── README.html
    └── README.md

10.4.6.2 Network/Subnet specification

The elements required in the input model for you to define a network are:

  • name

  • network_type

  • physical_network

Elements that are optional when defining a network are:

  • segmentation_id

  • shared

Required elements for the subnet definition are:

  • cidr

Optional elements for the subnet definition are:

  • allocation_pools which will require start and end addresses

  • host_routes which will require a destination and nexthop

  • gateway_ip

  • no_gateway

  • enable-dhcp

NOTE: Only IPv4 is supported at the present time.

10.4.6.3 Network details

The following table outlines the network values to be set, and what they represent.

AttributeRequired/optionalAllowed ValuesUsage
nameRequired  
network_typeRequiredflat, vlan, vxlanThe type of desired network
physical_networkRequiredValidName of physical network that is overlayed with the virtual network
segmentation_idOptionalvlan or vxlan rangesVLAN id for vlan or tunnel id for vxlan
sharedOptionalTrueShared by all projects or private to a single project

10.4.6.4 Subnet details

The following table outlines the subnet values to be set, and what they represent.

AttributeReq/OptAllowed ValuesUsage
cidrRequiredValid CIDR rangefor example, 172.30.0.0/24
allocation_poolsOptionalSee allocation_pools table below 
host_routesOptionalSee host_routes table below 
gateway_ipOptionalValid IP addrSubnet gateway to other nets
no_gatewayOptionalTrueNo distribution of gateway
enable-dhcpOptionalTrueEnable dhcp for this subnet

10.4.6.5 ALLOCATION_POOLS details

The following table explains allocation pool settings.

AttributeReq/OptAllowed ValuesUsage
startRequiredValid IP addrFirst ip address in pool
endRequiredValid IP addrLast ip address in pool

10.4.6.6 HOST_ROUTES details

The following table explains host route settings.

AttributeReq/OptAllowed ValuesUsage
destinationRequiredValid CIDRDestination subnet
nexthopRequiredValid IP addrHop to take to destination subnet
Note
Note

Multiple destination/nexthop values can be used.

10.4.6.7 Examples

The following examples show the configuration file settings for neutron and Octavia.

Octavia configuration

This file defines the mapping. It does not need to be edited unless you want to change the name of your VLAN.

Path: ~/openstack/my_cloud/definition/data/octavia/octavia_config.yml

---
  product:
    version: 2

  configuration-data:
    - name: OCTAVIA-CONFIG-CP1
      services:
        - octavia
      data:
        amp_network_name: OCTAVIA-MGMT-NET

neutron configuration

Input your network configuration information for your provider VLANs in neutron_config.yml found here:

~/openstack/my_cloud/definition/data/neutron/.

---
  product:
    version: 2

  configuration-data:
    - name:  NEUTRON-CONFIG-CP1
      services:
        - neutron
      data:
        neutron_provider_networks:
        - name: OCTAVIA-MGMT-NET
          provider:
            - network_type: vlan
              physical_network: physnet1
              segmentation_id: 2754
          cidr: 10.13.189.0/24
          no_gateway:  True
          enable_dhcp: True
          allocation_pools:
            - start: 10.13.189.4
              end: 10.13.189.252
          host_routes:
            # route to MANAGEMENT-NET
            - destination: 10.13.111.128/26
              nexthop:  10.13.189.5

10.4.6.8 Implementing your changes

  1. Commit the changes to git:

    ardana > git add -A
    ardana > git commit -a -m "configuring provider network"
  2. Run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  3. Update your deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  4. Then continue with your clean cloud installation.

  5. If you are only adding a neutron Provider network to an existing model, then run the neutron-deploy.yml playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts neutron-deploy.yml

10.4.6.9 Multiple Provider Networks

The physical network infrastructure must be configured to convey the provider VLAN traffic as tagged VLANs to the cloud compute nodes and network service network nodes. Configuration of the physical network infrastructure is outside the scope of the SUSE OpenStack Cloud 9 software.

SUSE OpenStack Cloud 9 automates the server networking configuration and the Network Service configuration based on information in the cloud definition. To configure the system for provider VLANs, specify the neutron.networks.vlan tag with a provider-physical-network attribute on one or more network groups. For example (some attributes omitted for brevity):

network-groups:

        - name: NET_GROUP_A
        tags:
        - neutron.networks.vlan:
        provider-physical-network: physnet1

        - name: NET_GROUP_B
        tags:
        - neutron.networks.vlan:
        provider-physical-network: physnet2

A network group is associated with a server network interface via an interface model. For example (some attributes omitted for brevity):

interface-models:
        - name: INTERFACE_SET_X
        network-interfaces:
        - device:
        name: bond0
        network-groups:
        - NET_GROUP_A
        - device:
        name: eth3
        network-groups:
        - NET_GROUP_B

A network group used for provider VLANs may contain only a single SUSE OpenStack Cloud network, because that VLAN must span all compute nodes and any Network Service network nodes/controllers (that is, it is a single L2 segment). The SUSE OpenStack Cloud network must be defined with tagged-vlan false, otherwise a Linux VLAN network interface will be created. For example:

networks:

        - name: NET_A
        tagged-vlan: false
        network-group: NET_GROUP_A

        - name: NET_B
        tagged-vlan: false
        network-group: NET_GROUP_B

When the cloud is deployed, SUSE OpenStack Cloud 9 will create the appropriate bridges on the servers, and set the appropriate attributes in the neutron configuration files (for example, bridge_mappings).

After the cloud has been deployed, create Network Service network objects for each provider VLAN. For example, using the Network Service CLI:

ardana > openstack network create --provider:network_type vlan --provider:physical_network physnet1 --provider-segment 101 mynet101
ardana > openstack network create --provider:network_type vlan --provider:physical_network physnet2 --provider-segment 234 mynet234

10.4.6.10 More Information

For more information on the Network Service command-line interface (CLI), see the OpenStack networking command-line client reference: http://docs.openstack.org/cli-reference/content/neutronclient_commands.html

10.4.7 Using IPAM Drivers in the Networking Service

This topic describes how to choose and implement an IPAM driver.

10.4.7.1 Selecting and implementing an IPAM driver

Beginning with the Liberty release, OpenStack networking includes a pluggable interface for the IP Address Management (IPAM) function. This interface creates a driver framework for the allocation and de-allocation of subnets and IP addresses, enabling the integration of alternate IPAM implementations or third-party IP Address Management systems.

There are three possible IPAM driver options:

  • Non-pluggable driver. This option is the default when the ipam_driver parameter is not specified in neutron.conf.

  • Pluggable reference IPAM driver. The pluggable IPAM driver interface was introduced in SUSE OpenStack Cloud 9 (OpenStack Liberty). It is a refactoring of the Kilo non-pluggable driver to use the new pluggable interface. The setting in neutron.conf to specify this driver is ipam_driver = internal.

  • Pluggable Infoblox IPAM driver. The pluggable Infoblox IPAM driver is a third-party implementation of the pluggable IPAM interface. the corresponding setting in neutron.conf to specify this driver is ipam_driver = networking_infoblox.ipam.driver.InfobloxPool.

    Note
    Note

    You can use either the non-pluggable IPAM driver or a pluggable one. However, you cannot use both.

10.4.7.2 Using the Pluggable reference IPAM driver

To indicate that you want to use the Pluggable reference IPAM driver, the only parameter needed is "ipam_driver." You can set it by looking for the following commented line in the neutron.conf.j2 template (ipam_driver = internal) uncommenting it, and committing the file. After following the standard steps to deploy neutron, neutron will be configured to run using the Pluggable reference IPAM driver.

As stated, the file you must edit is neutron.conf.j2 on the Cloud Lifecycle Manager in the directory ~/openstack/my_cloud/config/neutron. Here is the relevant section where you can see the ipam_driver parameter commented out:

[DEFAULT]
  ...
  l3_ha_net_cidr = 169.254.192.0/18

  # Uncomment the line below if the Reference Pluggable IPAM driver is to be used
  # ipam_driver = internal
  ...

After uncommenting the line ipam_driver = internal, commit the file using git commit from the openstack/my_cloud directory:

ardana > git commit -a -m 'My config for enabling the internal IPAM Driver'

Then follow the steps to deploy SUSE OpenStack Cloud in the Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 13 “Overview” appropriate to your cloud configuration.

Note
Note

Currently there is no migration path from the non-pluggable driver to a pluggable IPAM driver because changes are needed to database tables and neutron currently cannot make those changes.

10.4.7.3 Using the Infoblox IPAM driver

As suggested above, using the Infoblox IPAM driver requires changes to existing parameters in nova.conf and neutron.conf. If you want to use the infoblox appliance, you will need to add the "infoblox service-component" to the service-role containing the neutron API server. To use the infoblox appliance for IPAM, both the agent and the Infoblox IPAM driver are required. The infoblox-ipam-agent should be deployed on the same node where the neutron-server component is running. Usually this is a Controller node.

  1. Have the Infoblox appliance running on the management network (the Infoblox appliance admin or the datacenter administrator should know how to perform this step).

  2. Change the control plane definition to add infoblox-ipam-agent as a service in the controller node cluster (see change in bold). Make the changes in control_plane.yml found here: ~/openstack/my_cloud/definition/data/control_plane.yml

    ---
      product:
        version: 2
    
      control-planes:
        - name: ccp
          control-plane-prefix: ccp
     ...
          clusters:
            - name: cluster0
              cluster-prefix: c0
              server-role: ARDANA-ROLE
              member-count: 1
              allocation-policy: strict
              service-components:
                - lifecycle-manager
            - name: cluster1
              cluster-prefix: c1
              server-role: CONTROLLER-ROLE
              member-count: 3
              allocation-policy: strict
              service-components:
                - ntp-server
    ...
                - neutron-server
                - infoblox-ipam-agent
    ...
                - designate-client
                - bind
          resources:
            - name: compute
              resource-prefix: comp
              server-role: COMPUTE-ROLE
              allocation-policy: any
  3. Modify the ~/openstack/my_cloud/config/neutron/neutron.conf.j2 file on the controller node to comment and uncomment the lines noted below to enable use with the Infoblox appliance:

    [DEFAULT]
                ...
                l3_ha_net_cidr = 169.254.192.0/18
    
    
                # Uncomment the line below if the Reference Pluggable IPAM driver is to be used
                # ipam_driver = internal
    
    
                # Comment out the line below if the Infoblox IPAM Driver is to be used
                # notification_driver = messaging
    
                # Uncomment the lines below if the Infoblox IPAM driver is to be used
                ipam_driver = networking_infoblox.ipam.driver.InfobloxPool
                notification_driver = messagingv2
    
    
                # Modify the infoblox sections below to suit your cloud environment
    
                [infoblox]
                cloud_data_center_id = 1
                # This name of this section is formed by "infoblox-dc:<infoblox.cloud_data_center_id>"
                # If cloud_data_center_id is 1, then the section name is "infoblox-dc:1"
    
                [infoblox-dc:0]
                http_request_timeout = 120
                http_pool_maxsize = 100
                http_pool_connections = 100
                ssl_verify = False
                wapi_version = 2.2
                admin_user_name = admin
                admin_password = infoblox
                grid_master_name = infoblox.localdomain
                grid_master_host = 1.2.3.4
    
    
                [QUOTAS]
                ...
  4. Change nova.conf.j2 to replace the notification driver "messaging" to "messagingv2"

     ...
    
     # Oslo messaging
     notification_driver = log
    
     #  Note:
     #  If the infoblox-ipam-agent is to be deployed in the cloud, change the
     #  notification_driver setting from "messaging" to "messagingv2".
     notification_driver = messagingv2
     notification_topics = notifications
    
     # Policy
     ...
  5. Commit the changes:

    ardana > cd ~/openstack/my_cloud
    ardana > git commit –a –m 'My config for enabling the Infoblox IPAM driver'
  6. Deploy the cloud with the changes. Due to changes to the control_plane.yml, you will need to rerun the config-processor-run.yml playbook if you have run it already during the install process.

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts site.yml

10.4.7.4 Configuration parameters for using the Infoblox IPAM driver

Changes required in the notification parameters in nova.conf:

Parameter NameSection in nova.confDefault ValueCurrent Value Description
notify_on_state_changeDEFAULTNonevm_and_task_state

Send compute.instance.update notifications on instance state changes.

Vm_and_task_state means notify on vm and task state changes.

Infoblox requires the value to be vm_state (notify on vm state change).

Thus NO CHANGE is needed for infoblox

notification_topicsDEFAULTempty listnotifications

NO CHANGE is needed for infoblox.

The infoblox installation guide requires the notifications to be "notifications"

notification_driverDEFAULTNonemessaging

Change needed.

The infoblox installation guide requires the notification driver to be "messagingv2".

Changes to existing parameters in neutron.conf

Parameter NameSection in neutron.confDefault ValueCurrent Value Description
ipam_driverDEFAULTNone

None

(param is undeclared in neutron.conf)

Pluggable IPAM driver to be used by neutron API server.

For infoblox, the value is "networking_infoblox.ipam.driver.InfobloxPool"

notification_driverDEFAULTempty listmessaging

The driver used to send notifications from the neutron API server to the neutron agents.

The installation guide for networking-infoblox calls for the notification_driver to be "messagingv2"

notification_topicsDEFAULTNonenotifications

No change needed.

The row is here show the changes in the neutron parameters described in the installation guide for networking-infoblox

Parameters specific to the Networking Infoblox Driver. All the parameters for the Infoblox IPAM driver must be defined in neutron.conf.

Parameter NameSection in neutron.confDefault ValueDescription
cloud_data_center_idinfoblox0ID for selecting a particular grid from one or more grids to serve networks in the Infoblox back end
ipam_agent_workersinfoblox1Number of Infoblox IPAM agent works to run
grid_master_hostinfoblox-dc.<cloud_data_center_id>empty stringIP address of the grid master. WAPI requests are sent to the grid_master_host
ssl_verifyinfoblox-dc.<cloud_data_center_id>FalseEnsure whether WAPI requests sent over HTTPS require SSL verification
WAPI Versioninfoblox-dc.<cloud_data_center_id>1.4The WAPI version. Value should be 2.2.
admin_user_nameinfoblox-dc.<cloud_data_center_id>empty stringAdmin user name to access the grid master or cloud platform appliance
admin_passwordinfoblox-dc.<cloud_data_center_id>empty stringAdmin user password
http_pool_connectionsinfoblox-dc.<cloud_data_center_id>100 
http_pool_maxsizeinfoblox-dc.<cloud_data_center_id>100 
http_request_timeoutinfoblox-dc.<cloud_data_center_id>120 

The diagram below shows nova compute sending notification to the infoblox-ipam-agent

Image

10.4.7.5 Limitations

  • There is no IPAM migration path from non-pluggable to pluggable IPAM driver (https://bugs.launchpad.net/neutron/+bug/1516156). This means there is no way to reconfigure the neutron database if you wanted to change neutron to use a pluggable IPAM driver. Unless you change the default of non-pluggable IPAM configuration to a pluggable driver at install time, you will have no other opportunity to make that change because reconfiguration of SUSE OpenStack Cloud 9from using the default non-pluggable IPAM configuration to SUSE OpenStack Cloud 9 using a pluggable IPAM driver is not supported.

  • Upgrade from previous versions of SUSE OpenStack Cloud to SUSE OpenStack Cloud 9 to use a pluggable IPAM driver is not supported.

  • The Infoblox appliance does not allow for overlapping IPs. For example, only one tenant can have a CIDR of 10.0.0.0/24.

  • The infoblox IPAM driver fails the creation of a subnet when a there is no gateway-ip supplied. For example, the command openstack subnet create ... --no-gateway ... will fail.

10.4.8 Configuring Load Balancing as a Service (LBaaS)

SUSE OpenStack Cloud 9 LBaaS Configuration

Load Balancing as a Service (LBaaS) is an advanced networking service that allows load balancing of multi-node environments. It provides the ability to spread requests across multiple servers thereby reducing the load on any single server. This document describes the installation steps and the configuration for LBaaS v2.

Warning
Warning

The LBaaS architecture is based on a driver model to support different load balancers. LBaaS-compatible drivers are provided by load balancer vendors including F5 and Citrix. A new software load balancer driver was introduced in the OpenStack Liberty release called "Octavia". The Octavia driver deploys a software load balancer called HAProxy. Octavia is the default load balancing provider in SUSE OpenStack Cloud 9 for LBaaS V2. Until Octavia is configured the creation of load balancers will fail with an error. Refer to Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 43 “Configuring Load Balancer as a Service” document for information on installing Octavia.

Warning
Warning

Before upgrading to SUSE OpenStack Cloud 9, contact F5 and SUSE to determine which F5 drivers have been certified for use with SUSE OpenStack Cloud. Loading drivers not certified by SUSE may result in failure of your cloud deployment.

LBaaS V2 offers with Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 43 “Configuring Load Balancer as a Service” a software load balancing solution that supports both a highly available control plane and data plane. However, should an external hardware load balancer be selected the cloud operation can achieve additional performance and availability.

LBaaS v2

  1. Your vendor already has a driver that supports LBaaS v2. Many hardware load balancer vendors already support LBaaS v2 and this list is growing all the time.

  2. You intend to script your load balancer creation and management so a UI is not important right now (horizon support will be added in a future release).

  3. You intend to support TLS termination at the load balancer.

  4. You intend to use the Octavia software load balancer (adding HA and scalability).

  5. You do not want to take your load balancers offline to perform subsequent LBaaS upgrades.

  6. You intend in future releases to need L7 load balancing.

Reasons not to select this version.

  1. Your LBaaS vendor does not have a v2 driver.

  2. You must be able to manage your load balancers from horizon.

  3. You have legacy software which utilizes the LBaaS v1 API.

LBaaS v2 is installed by default with SUSE OpenStack Cloud and requires minimal configuration to start the service.

Note
Note

LBaaS V2 API currently supports load balancer failover with Octavia. LBaaS v2 API includes automatic failover of a deployed load balancer with Octavia. More information about this driver can be found in Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 43 “Configuring Load Balancer as a Service”.

10.4.8.1 Prerequisites

SUSE OpenStack Cloud LBaaS v2

  1. SUSE OpenStack Cloud must be installed for LBaaS v2.

  2. Follow the instructions to install Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 43 “Configuring Load Balancer as a Service”

10.4.9 Load Balancer: Octavia Driver Administration

This document provides the instructions on how to enable and manage various components of the Load Balancer Octavia driver if that driver is enabled.

10.4.9.1 Monasca Alerts

The monasca-agent has the following Octavia-related plugins:

  • Process checks – checks if octavia processes are running. When it starts, it detects which processes are running and then monitors them.

  • http_connect check – checks if it can connect to octavia api servers.

Alerts are displayed in the Operations Console.

10.4.9.2 Tuning Octavia Installation

Homogeneous Compute Configuration

Octavia works only with homogeneous compute node configurations. Currently, Octavia does not support multiple nova flavors. If Octavia needs to be supported on multiple compute nodes, then all the compute nodes should carry same set of physnets (which will be used for Octavia).

Octavia and Floating IPs

Due to a neutron limitation Octavia will only work with CVR routers. Another option is to use VLAN provider networks which do not require a router.

You cannot currently assign a floating IP address as the VIP (user facing) address for a load balancer created by the Octavia driver if the underlying neutron network is configured to support Distributed Virtual Router (DVR). The Octavia driver uses a neutron function known as allowed address pairs to support load balancer fail over.

There is currently a neutron bug that does not support this function in a DVR configuration

Octavia Configuration Files

The system comes pre-tuned and should not need any adjustments for most customers. If in rare instances manual tuning is needed, follow these steps:

Warning
Warning

Changes might be lost during SUSE OpenStack Cloud upgrades.

Edit the Octavia configuration files in my_cloud/config/octavia. It is recommended that any changes be made in all of the Octavia configuration files.

  • octavia-api.conf.j2

  • octavia-health-manager.conf.j2

  • octavia-housekeeping.conf.j2

  • octavia-worker.conf.j2

After the changes are made to the configuration files, redeploy the service.

  1. Commit changes to git.

    ardana > cd ~/openstack
    ardana > git add -A
    ardana > git commit -m "My Octavia Config"
  2. Run the configuration processor and ready deployment.

    ardana > cd ~/openstack/ardana/ansible/
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  3. Run the Octavia reconfigure.

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts octavia-reconfigure.yml

Spare Pools

The Octavia driver provides support for creating spare pools of the HAProxy software installed in VMs. This means instead of creating a new load balancer when loads increase, create new load balancer calls will pull a load balancer from the spare pool. The spare pools feature consumes resources, therefore the load balancers in the spares pool has been set to 0, which is the default and also disables the feature.

Reasons to enable a load balancing spare pool in SUSE OpenStack Cloud

  1. You expect a large number of load balancers to be provisioned all at once (puppet scripts, or ansible scripts) and you want them to come up quickly.

  2. You want to reduce the wait time a customer has while requesting a new load balancer.

To increase the number of load balancers in your spares pool, edit the Octavia configuration files by uncommenting the spare_amphora_pool_size and adding the number of load balancers you would like to include in your spares pool.

# Pool size for the spare pool
# spare_amphora_pool_size = 0

10.4.9.3 Managing Amphora

Octavia starts a separate VM for each load balancing function. These VMs are called amphora.

Updating the Cryptographic Certificates

Octavia uses two-way SSL encryption for communication between amphora and the control plane. Octavia keeps track of the certificates on the amphora and will automatically recycle them. The certificates on the control plane are valid for one year after installation of SUSE OpenStack Cloud.

You can check on the status of the certificate by logging into the controller node as root and running:

ardana > cd /opt/stack/service/octavia-SOME UUID/etc/certs/
openssl x509 -in client.pem  -text –noout

This prints the certificate out where you can check on the expiration dates.

To renew the certificates, reconfigure Octavia. Reconfiguring causes Octavia to automatically generate new certificates and deploy them to the controller hosts.

On the Cloud Lifecycle Manager execute octavia-reconfigure:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts octavia-reconfigure.yml

Accessing VM information in nova

You can use openstack project list as an administrative user to obtain information about the tenant or project-id of the Octavia project. In the example below, the Octavia project has a project-id of 37fd6e4feac14741b6e75aba14aea833.

ardana > openstack project list
+----------------------------------+------------------+
| ID                               | Name             |
+----------------------------------+------------------+
| 055071d8f25d450ea0b981ca67f7ccee | glance-swift     |
| 37fd6e4feac14741b6e75aba14aea833 | octavia          |
| 4b431ae087ef4bd285bc887da6405b12 | swift-monitor    |
| 8ecf2bb5754646ae97989ba6cba08607 | swift-dispersion |
| b6bd581f8d9a48e18c86008301d40b26 | services         |
| bfcada17189e4bc7b22a9072d663b52d | cinderinternal   |
| c410223059354dd19964063ef7d63eca | monitor          |
| d43bc229f513494189422d88709b7b73 | admin            |
| d5a80541ba324c54aeae58ac3de95f77 | demo             |
| ea6e039d973e4a58bbe42ee08eaf6a7a | backup           |
+----------------------------------+------------------+

You can then use openstack server list --tenant <project-id> to list the VMs for the Octavia tenant. Take particular note of the IP address on the OCTAVIA-MGMT-NET; in the example below it is 172.30.1.11. For additional nova command-line options see Section 10.4.9.5, “For More Information”.

ardana > openstack server list --tenant 37fd6e4feac14741b6e75aba14aea833
+--------------------------------------+----------------------------------------------+----------------------------------+--------+------------+-------------+------------------------------------------------+
| ID                                   | Name                                         | Tenant ID                        | Status | Task State | Power State | Networks                                       |
+--------------------------------------+----------------------------------------------+----------------------------------+--------+------------+-------------+------------------------------------------------+
| 1ed8f651-de31-4208-81c5-817363818596 | amphora-1c3a4598-5489-48ea-8b9c-60c821269e4c | 37fd6e4feac14741b6e75aba14aea833 | ACTIVE | -          | Running     | private=10.0.0.4; OCTAVIA-MGMT-NET=172.30.1.11 |
+--------------------------------------+----------------------------------------------+----------------------------------+--------+------------+-------------+------------------------------------------------+
Important
Important

The Amphora VMs do not have SSH or any other access. In the rare case that there is a problem with the underlying load balancer the whole amphora will need to be replaced.

Initiating Failover of an Amphora VM

Under normal operations Octavia will monitor the health of the amphora constantly and automatically fail them over if there are any issues. This helps to minimize any potential downtime for load balancer users. There are, however, a few cases a failover needs to be initiated manually:

  1. The Loadbalancer has become unresponsive and Octavia has not detected an error.

  2. A new image has become available and existing load balancers need to start using the new image.

  3. The cryptographic certificates to control and/or the HMAC password to verify Health information of the amphora have been compromised.

To minimize the impact for end users we will keep the existing load balancer working until shortly before the new one has been provisioned. There will be a short interruption for the load balancing service so keep that in mind when scheduling the failovers. To achieve that follow these steps (assuming the management ip from the previous step):

  1. Assign the IP to a SHELL variable for better readability.

    ardana > export MGM_IP=172.30.1.11
  2. Identify the port of the vm on the management network.

    ardana > openstack port list | grep $MGM_IP
    | 0b0301b9-4ee8-4fb6-a47c-2690594173f4 |                                                   | fa:16:3e:d7:50:92 |
    {"subnet_id": "3e0de487-e255-4fc3-84b8-60e08564c5b7", "ip_address": "172.30.1.11"} |
  3. Disable the port to initiate a failover. Note the load balancer will still function but cannot be controlled any longer by Octavia.

    Note
    Note

    Changes after disabling the port will result in errors.

    ardana > openstack port set --admin-state-up False 0b0301b9-4ee8-4fb6-a47c-2690594173f4
    Updated port: 0b0301b9-4ee8-4fb6-a47c-2690594173f4
  4. You can check to see if the amphora failed over with openstack server list --tenant <project-id>. This may take some time and in some cases may need to be repeated several times. You can tell that the failover has been successful by the changed IP on the management network.

    ardana > openstack server list --tenant 37fd6e4feac14741b6e75aba14aea833
    +--------------------------------------+----------------------------------------------+----------------------------------+--------+------------+-------------+------------------------------------------------+
    | ID                                   | Name                                         | Tenant ID                        | Status | Task State | Power State | Networks                                       |
    +--------------------------------------+----------------------------------------------+----------------------------------+--------+------------+-------------+------------------------------------------------+
    | 1ed8f651-de31-4208-81c5-817363818596 | amphora-1c3a4598-5489-48ea-8b9c-60c821269e4c | 37fd6e4feac14741b6e75aba14aea833 | ACTIVE | -          | Running     | private=10.0.0.4; OCTAVIA-MGMT-NET=172.30.1.12 |
    +--------------------------------------+----------------------------------------------+----------------------------------+--------+------------+-------------+------------------------------------------------+
Warning
Warning

Do not issue too many failovers at once. In a big installation you might be tempted to initiate several failovers in parallel for instance to speed up an update of amphora images. This will put a strain on the nova service and depending on the size of your installation you might need to throttle the failover rate.

10.4.9.4 Load Balancer: Octavia Administration

10.4.9.4.1 Removing load balancers

The following procedures demonstrate how to delete a load balancer that is in the ERROR, PENDING_CREATE, or PENDING_DELETE state.

Procedure 10.3: Manually deleting load balancers created with neutron lbaasv2 (in an upgrade/migration scenario)
  1. Query the Neutron service for the loadbalancer ID:

    tux > neutron lbaas-loadbalancer-list
    neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
    +--------------------------------------+---------+----------------------------------+--------------+---------------------+----------+
    | id                                   | name    | tenant_id                        | vip_address  | provisioning_status | provider |
    +--------------------------------------+---------+----------------------------------+--------------+---------------------+----------+
    | 7be4e4ab-e9c6-4a57-b767-da9af5ba7405 | test-lb | d62a1510b0f54b5693566fb8afeb5e33 | 192.168.1.10 | ERROR               | haproxy  |
    +--------------------------------------+---------+----------------------------------+--------------+---------------------+----------+
  2. Connect to the neutron database:

    Important
    Important

    The default database name depends on the life cycle manager. Ardana uses ovs_neutron while Crowbar uses neutron.

    Ardana:

    mysql> use ovs_neutron

    Crowbar:

    mysql> use neutron
  3. Get the pools and healthmonitors associated with the loadbalancer:

    mysql> select id, healthmonitor_id, loadbalancer_id from lbaas_pools where loadbalancer_id = '7be4e4ab-e9c6-4a57-b767-da9af5ba7405';
    +--------------------------------------+--------------------------------------+--------------------------------------+
    | id                                   | healthmonitor_id                     | loadbalancer_id                      |
    +--------------------------------------+--------------------------------------+--------------------------------------+
    | 26c0384b-fc76-4943-83e5-9de40dd1c78c | 323a3c4b-8083-41e1-b1d9-04e1fef1a331 | 7be4e4ab-e9c6-4a57-b767-da9af5ba7405 |
    +--------------------------------------+--------------------------------------+--------------------------------------+
  4. Get the members associated with the pool:

    mysql> select id, pool_id from lbaas_members where pool_id = '26c0384b-fc76-4943-83e5-9de40dd1c78c';
    +--------------------------------------+--------------------------------------+
    | id                                   | pool_id                              |
    +--------------------------------------+--------------------------------------+
    | 6730f6c1-634c-4371-9df5-1a880662acc9 | 26c0384b-fc76-4943-83e5-9de40dd1c78c |
    | 06f0cfc9-379a-4e3d-ab31-cdba1580afc2 | 26c0384b-fc76-4943-83e5-9de40dd1c78c |
    +--------------------------------------+--------------------------------------+
  5. Delete the pool members:

    mysql> delete from lbaas_members where id = '6730f6c1-634c-4371-9df5-1a880662acc9';
    mysql> delete from lbaas_members where id = '06f0cfc9-379a-4e3d-ab31-cdba1580afc2';
  6. Find and delete the listener associated with the loadbalancer:

    mysql> select id, loadbalancer_id, default_pool_id from lbaas_listeners where loadbalancer_id = '7be4e4ab-e9c6-4a57-b767-da9af5ba7405';
    +--------------------------------------+--------------------------------------+--------------------------------------+
    | id                                   | loadbalancer_id                      | default_pool_id                      |
    +--------------------------------------+--------------------------------------+--------------------------------------+
    | 3283f589-8464-43b3-96e0-399377642e0a | 7be4e4ab-e9c6-4a57-b767-da9af5ba7405 | 26c0384b-fc76-4943-83e5-9de40dd1c78c |
    +--------------------------------------+--------------------------------------+--------------------------------------+
    mysql> delete from lbaas_listeners where id = '3283f589-8464-43b3-96e0-399377642e0a';
  7. Delete the pool associated with the loadbalancer:

    mysql> delete from lbaas_pools where id = '26c0384b-fc76-4943-83e5-9de40dd1c78c';
  8. Delete the healthmonitor associated with the pool:

    mysql> delete from lbaas_healthmonitors where id = '323a3c4b-8083-41e1-b1d9-04e1fef1a331';
  9. Delete the loadbalancer:

    mysql> delete from lbaas_loadbalancer_statistics where loadbalancer_id = '7be4e4ab-e9c6-4a57-b767-da9af5ba7405';
    mysql> delete from lbaas_loadbalancers where id = '7be4e4ab-e9c6-4a57-b767-da9af5ba7405';
Procedure 10.4: Manually Deleting Load Balancers Created With Octavia
  1. Query the Octavia service for the loadbalancer ID:

    tux > openstack loadbalancer list --column id --column name --column provisioning_status
    +--------------------------------------+---------+---------------------+
    | id                                   | name    | provisioning_status |
    +--------------------------------------+---------+---------------------+
    | d8ac085d-e077-4af2-b47a-bdec0c162928 | test-lb | ERROR               |
    +--------------------------------------+---------+---------------------+
  2. Query the Octavia service for the amphora IDs (in this example we use ACTIVE/STANDBY topology with 1 spare Amphora):

    tux > openstack loadbalancer amphora list
    +--------------------------------------+--------------------------------------+-----------+--------+---------------+-------------+
    | id                                   | loadbalancer_id                      | status    | role   | lb_network_ip | ha_ip       |
    +--------------------------------------+--------------------------------------+-----------+--------+---------------+-------------+
    | 6dc66d41-e4b6-4c33-945d-563f8b26e675 | d8ac085d-e077-4af2-b47a-bdec0c162928 | ALLOCATED | BACKUP | 172.30.1.7    | 192.168.1.8 |
    | 1b195602-3b14-4352-b355-5c4a70e200cf | d8ac085d-e077-4af2-b47a-bdec0c162928 | ALLOCATED | MASTER | 172.30.1.6    | 192.168.1.8 |
    | b2ee14df-8ac6-4bb0-a8d3-3f378dbc2509 | None                                 | READY     | None   | 172.30.1.20   | None        |
    +--------------------------------------+--------------------------------------+-----------+--------+---------------+-------------+
  3. Query the Octavia service for the loadbalancer pools:

    tux > openstack loadbalancer pool list
    +--------------------------------------+-----------+----------------------------------+---------------------+----------+--------------+----------------+
    | id                                   | name      | project_id                       | provisioning_status | protocol | lb_algorithm | admin_state_up |
    +--------------------------------------+-----------+----------------------------------+---------------------+----------+--------------+----------------+
    | 39c4c791-6e66-4dd5-9b80-14ea11152bb5 | test-pool | 86fba765e67f430b83437f2f25225b65 | ACTIVE              | TCP      | ROUND_ROBIN  | True           |
    +--------------------------------------+-----------+----------------------------------+---------------------+----------+--------------+----------------+
  4. Connect to the octavia database:

    mysql> use octavia
  5. Delete any listeners, pools, health monitors, and members from the load balancer:

    mysql> delete from listener where load_balancer_id = 'd8ac085d-e077-4af2-b47a-bdec0c162928';
    mysql> delete from health_monitor where pool_id = '39c4c791-6e66-4dd5-9b80-14ea11152bb5';
    mysql> delete from member where pool_id = '39c4c791-6e66-4dd5-9b80-14ea11152bb5';
    mysql> delete from pool where load_balancer_id = 'd8ac085d-e077-4af2-b47a-bdec0c162928';
  6. Delete the amphora entries in the database:

    mysql> delete from amphora_health where amphora_id = '6dc66d41-e4b6-4c33-945d-563f8b26e675';
    mysql> update amphora set status = 'DELETED' where id = '6dc66d41-e4b6-4c33-945d-563f8b26e675';
    mysql> delete from amphora_health where amphora_id = '1b195602-3b14-4352-b355-5c4a70e200cf';
    mysql> update amphora set status = 'DELETED' where id = '1b195602-3b14-4352-b355-5c4a70e200cf';
  7. Delete the load balancer instance:

    mysql> update load_balancer set provisioning_status = 'DELETED' where id = 'd8ac085d-e077-4af2-b47a-bdec0c162928';
  8. The following script automates the above steps:

    #!/bin/bash
    
    if (( $# != 1 )); then
    echo "Please specify a loadbalancer ID"
    exit 1
    fi
    
    LB_ID=$1
    
    set -u -e -x
    
    readarray -t AMPHORAE < <(openstack loadbalancer amphora list \
    --format value \
    --column id \
    --column loadbalancer_id \
    | grep ${LB_ID} \
    | cut -d ' ' -f 1)
    
    readarray -t POOLS < <(openstack loadbalancer show ${LB_ID} \
    --format value \
    --column pools)
    
    mysql octavia --execute "delete from listener where load_balancer_id = '${LB_ID}';"
    for p in "${POOLS[@]}"; do
    mysql octavia --execute "delete from health_monitor where pool_id = '${p}';"
    mysql octavia --execute "delete from member where pool_id = '${p}';"
    done
    mysql octavia --execute "delete from pool where load_balancer_id = '${LB_ID}';"
    for a in "${AMPHORAE[@]}"; do
    mysql octavia --execute "delete from amphora_health where amphora_id = '${a}';"
    mysql octavia --execute "update amphora set status = 'DELETED' where id = '${a}';"
    done
    mysql octavia --execute "update load_balancer set provisioning_status = 'DELETED' where id = '${LB_ID}';"

10.4.9.5 For More Information

For more information on the OpenStackClient and Octavia terminology, see the OpenStackClient guide.

10.4.10 Role-based Access Control in neutron

This topic explains how to achieve more granular access control for your neutron networks.

Previously in SUSE OpenStack Cloud, a network object was either private to a project or could be used by all projects. If the network's shared attribute was True, then the network could be used by every project in the cloud. If false, only the members of the owning project could use it. There was no way for the network to be shared by only a subset of the projects.

neutron Role Based Access Control (RBAC) solves this problem for networks. Now the network owner can create RBAC policies that give network access to target projects. Members of a targeted project can use the network named in the RBAC policy the same way as if the network was owned by the project. Constraints are described in the section Section 10.4.10.10, “Limitations”.

With RBAC you are able to let another tenant use a network that you created, but as the owner of the network, you need to create the subnet and the router for the network.

10.4.10.1 Creating a Network

ardana > openstack network create demo-net
+---------------------------+--------------------------------------+
| Field                     | Value                                |
+---------------------------+--------------------------------------+
| admin_state_up            | UP                                   |
| availability_zone_hints   |                                      |
| availability_zones        |                                      |
| created_at                | 2018-07-25T17:43:59Z                 |
| description               |                                      |
| dns_domain                |                                      |
| id                        | 9c801954-ec7f-4a65-82f8-e313120aabc4 |
| ipv4_address_scope        | None                                 |
| ipv6_address_scope        | None                                 |
| is_default                | False                                |
| is_vlan_transparent       | None                                 |
| mtu                       | 1450                                 |
| name                      | demo-net                             |
| port_security_enabled     | False                                |
| project_id                | cb67c79e25a84e328326d186bf703e1b     |
| provider:network_type     | vxlan                                |
| provider:physical_network | None                                 |
| provider:segmentation_id  | 1009                                 |
| qos_policy_id             | None                                 |
| revision_number           | 2                                    |
| router:external           | Internal                             |
| segments                  | None                                 |
| shared                    | False                                |
| status                    | ACTIVE                               |
| subnets                   |                                      |
| tags                      |                                      |
| updated_at                | 2018-07-25T17:43:59Z                 |
+---------------------------+--------------------------------------+

10.4.10.2 Creating an RBAC Policy

Here we will create an RBAC policy where a member of the project called 'demo' will share the network with members of project 'demo2'

To create the RBAC policy, run:

ardana > openstack network rbac create  --target-project DEMO2-PROJECT-ID --type network --action access_as_shared demo-net

Here is an example where the DEMO2-PROJECT-ID is 5a582af8b44b422fafcd4545bd2b7eb5

ardana > openstack network rbac create --target-tenant 5a582af8b44b422fafcd4545bd2b7eb5 \
  --type network --action access_as_shared demo-net

10.4.10.3 Listing RBACs

To list all the RBAC rules/policies, execute:

ardana > openstack network rbac list
+--------------------------------------+-------------+--------------------------------------+
| ID                                   | Object Type | Object ID                            |
+--------------------------------------+-------------+--------------------------------------+
| 0fdec7f0-9b94-42b4-a4cd-b291d04282c1 | network     | 7cd94877-4276-488d-b682-7328fc85d721 |
+--------------------------------------+-------------+--------------------------------------+

10.4.10.4 Listing the Attributes of an RBAC

To see the attributes of a specific RBAC policy, run

ardana > openstack network rbac show POLICY-ID

For example:

ardana > openstack network rbac show 0fd89dcb-9809-4a5e-adc1-39dd676cb386

Here is the output:

+---------------+--------------------------------------+
| Field         | Value                                |
+---------------+--------------------------------------+
| action        | access_as_shared                     |
| id            | 0fd89dcb-9809-4a5e-adc1-39dd676cb386 |
| object_id     | c3d55c21-d8c9-4ee5-944b-560b7e0ea33b |
| object_type   | network                              |
| target_tenant | 5a582af8b44b422fafcd4545bd2b7eb5     |
| tenant_id     | 75eb5efae5764682bca2fede6f4d8c6f     |
+---------------+--------------------------------------+

10.4.10.5 Deleting an RBAC Policy

To delete an RBAC policy, run openstack network rbac delete passing the policy id:

ardana > openstack network rbac delete POLICY-ID

For example:

ardana > openstack network rbac delete 0fd89dcb-9809-4a5e-adc1-39dd676cb386

Here is the output:

Deleted rbac_policy: 0fd89dcb-9809-4a5e-adc1-39dd676cb386

10.4.10.6 Sharing a Network with All Tenants

Either the administrator or the network owner can make a network shareable by all tenants.

The administrator can make a tenant's network shareable by all tenants. To make the network demo-shareall-net accessible by all tenants in the cloud:

To share a network with all tenants:

  1. Get a list of all projects

    ardana > ~/service.osrc
    ardana > openstack project list

    which produces the list:

    +----------------------------------+------------------+
    | ID                               | Name             |
    +----------------------------------+------------------+
    | 1be57778b61645a7a1c07ca0ac488f9e | demo             |
    | 5346676226274cd2b3e3862c2d5ceadd | admin            |
    | 749a557b2b9c482ca047e8f4abf348cd | swift-monitor    |
    | 8284a83df4df429fb04996c59f9a314b | swift-dispersion |
    | c7a74026ed8d4345a48a3860048dcb39 | demo-sharee      |
    | e771266d937440828372090c4f99a995 | glance-swift     |
    | f43fb69f107b4b109d22431766b85f20 | services         |
    +----------------------------------+------------------+
  2. Get a list of networks:

    ardana > openstack network list

    This produces the following list:

    +--------------------------------------+-------------------+----------------------------------------------------+
    | id                                   | name              | subnets                                            |
    +--------------------------------------+-------------------+----------------------------------------------------+
    | f50f9a63-c048-444d-939d-370cb0af1387 | ext-net           | ef3873db-fc7a-4085-8454-5566fb5578ea 172.31.0.0/16 |
    | 9fb676f5-137e-4646-ac6e-db675a885fd3 | demo-net          | 18fb0b77-fc8b-4f8d-9172-ee47869f92cc 10.0.1.0/24   |
    | 8eada4f7-83cf-40ba-aa8c-5bf7d87cca8e | demo-shareall-net | 2bbc85a9-3ffe-464c-944b-2476c7804877 10.0.250.0/24 |
    | 73f946ee-bd2b-42e9-87e4-87f19edd0682 | demo-share-subset | c088b0ef-f541-42a7-b4b9-6ef3c9921e44 10.0.2.0/24   |
    +--------------------------------------+-------------------+----------------------------------------------------+
  3. Set the network you want to share to a shared value of True:

    ardana > openstack network set --share 8eada4f7-83cf-40ba-aa8c-5bf7d87cca8e

    You should see the following output:

    Updated network: 8eada4f7-83cf-40ba-aa8c-5bf7d87cca8e
  4. Check the attributes of that network by running the following command using the ID of the network in question:

    ardana > openstack network show 8eada4f7-83cf-40ba-aa8c-5bf7d87cca8e

    The output will look like this:

    +---------------------------+--------------------------------------+
    | Field                     | Value                                |
    +---------------------------+--------------------------------------+
    | admin_state_up            | UP                                   |
    | availability_zone_hints   |                                      |
    | availability_zones        |                                      |
    | created_at                | 2018-07-25T17:43:59Z                 |
    | description               |                                      |
    | dns_domain                |                                      |
    | id                        | 8eada4f7-83cf-40ba-aa8c-5bf7d87cca8e |
    | ipv4_address_scope        | None                                 |
    | ipv6_address_scope        | None                                 |
    | is_default                | None                                 |
    | is_vlan_transparent       | None                                 |
    | mtu                       | 1450                                 |
    | name                      | demo-net                             |
    | port_security_enabled     | False                                |
    | project_id                | cb67c79e25a84e328326d186bf703e1b     |
    | provider:network_type     | vxlan                                |
    | provider:physical_network | None                                 |
    | provider:segmentation_id  | 1009                                 |
    | qos_policy_id             | None                                 |
    | revision_number           | 2                                    |
    | router:external           | Internal                             |
    | segments                  | None                                 |
    | shared                    | False                                |
    | status                    | ACTIVE                               |
    | subnets                   |                                      |
    | tags                      |                                      |
    | updated_at                | 2018-07-25T17:43:59Z                 |
    +---------------------------+--------------------------------------+
  5. As the owner of the demo-shareall-net network, view the RBAC attributes for demo-shareall-net (id=8eada4f7-83cf-40ba-aa8c-5bf7d87cca8e) by first getting an RBAC list:

    ardana > echo $OS_USERNAME ; echo $OS_PROJECT_NAME
    demo
    demo
    ardana > openstack network rbac list

    This produces the list:

    +--------------------------------------+--------------------------------------+
    | id                                   | object_id                            |
    +--------------------------------------+--------------------------------------+
    | ...                                                                         |
    | 3e078293-f55d-461c-9a0b-67b5dae321e8 | 8eada4f7-83cf-40ba-aa8c-5bf7d87cca8e |
    +--------------------------------------+--------------------------------------+
  6. View the RBAC information:

    ardana > openstack network rbac show 3e078293-f55d-461c-9a0b-67b5dae321e8
    
    +---------------+--------------------------------------+
    | Field         | Value                                |
    +---------------+--------------------------------------+
    | action        | access_as_shared                     |
    | id            | 3e078293-f55d-461c-9a0b-67b5dae321e8 |
    | object_id     | 8eada4f7-83cf-40ba-aa8c-5bf7d87cca8e |
    | object_type   | network                              |
    | target_tenant | *                                    |
    | tenant_id     | 1be57778b61645a7a1c07ca0ac488f9e     |
    +---------------+--------------------------------------+
  7. With network RBAC, the owner of the network can also make the network shareable by all tenants. First create the network:

    ardana > echo $OS_PROJECT_NAME ; echo $OS_USERNAME
    demo
    demo
    ardana > openstack network create test-net

    The network is created:

    +---------------------------+--------------------------------------+
    | Field                     | Value                                |
    +---------------------------+--------------------------------------+
    | admin_state_up            | UP                                   |
    | availability_zone_hints   |                                      |
    | availability_zones        |                                      |
    | created_at                | 2018-07-25T18:04:25Z                 |
    | description               |                                      |
    | dns_domain                |                                      |
    | id                        | a4bd7c3a-818f-4431-8cdb-fedf7ff40f73 |
    | ipv4_address_scope        | None                                 |
    | ipv6_address_scope        | None                                 |
    | is_default                | False                                |
    | is_vlan_transparent       | None                                 |
    | mtu                       | 1450                                 |
    | name                      | test-net                             |
    | port_security_enabled     | False                                |
    | project_id                | cb67c79e25a84e328326d186bf703e1b     |
    | provider:network_type     | vxlan                                |
    | provider:physical_network | None                                 |
    | provider:segmentation_id  | 1073                                 |
    | qos_policy_id             | None                                 |
    | revision_number           | 2                                    |
    | router:external           | Internal                             |
    | segments                  | None                                 |
    | shared                    | False                                |
    | status                    | ACTIVE                               |
    | subnets                   |                                      |
    | tags                      |                                      |
    | updated_at                | 2018-07-25T18:04:25Z                 |
    +---------------------------+--------------------------------------+
  8. Create the RBAC. It is important that the asterisk is surrounded by single-quotes to prevent the shell from expanding it to all files in the current directory.

    ardana > openstack network rbac create --type network \
      --action access_as_shared --target-project '*' test-net

    Here are the resulting RBAC attributes:

    +---------------+--------------------------------------+
    | Field         | Value                                |
    +---------------+--------------------------------------+
    | action        | access_as_shared                     |
    | id            | 0b797cc6-debc-48a1-bf9d-d294b077d0d9 |
    | object_id     | a4bd7c3a-818f-4431-8cdb-fedf7ff40f73 |
    | object_type   | network                              |
    | target_tenant | *                                    |
    | tenant_id     | 1be57778b61645a7a1c07ca0ac488f9e     |
    +---------------+--------------------------------------+

10.4.10.7 Target Project (demo2) View of Networks and Subnets

Note that the owner of the network and subnet is not the tenant named demo2. Both the network and subnet are owned by tenant demo. Demo2members cannot create subnets of the network. They also cannot modify or delete subnets owned by demo.

As the tenant demo2, you can get a list of neutron networks:

ardana > openstack network list
+--------------------------------------+-----------+--------------------------------------------------+
| id                                   | name      | subnets                                          |
+--------------------------------------+-----------+--------------------------------------------------+
| f60f3896-2854-4f20-b03f-584a0dcce7a6 | ext-net   | 50e39973-b2e3-466b-81c9-31f4d83d990b             |
| c3d55c21-d8c9-4ee5-944b-560b7e0ea33b | demo-net  | d9b765da-45eb-4543-be96-1b69a00a2556 10.0.1.0/24 |
   ...
+--------------------------------------+-----------+--------------------------------------------------+

And get a list of subnets:

ardana > openstack subnet list --network c3d55c21-d8c9-4ee5-944b-560b7e0ea33b
+--------------------------------------+---------+--------------------------------------+---------------+
| ID                                   | Name    | Network                              | Subnet        |
+--------------------------------------+---------+--------------------------------------+---------------+
| a806f28b-ad66-47f1-b280-a1caa9beb832 | ext-net | c3d55c21-d8c9-4ee5-944b-560b7e0ea33b | 10.0.1.0/24   |
+--------------------------------------+---------+--------------------------------------+---------------+

To show details of the subnet:

ardana > openstack subnet show d9b765da-45eb-4543-be96-1b69a00a2556
+-------------------+--------------------------------------------+
| Field             | Value                                      |
+-------------------+--------------------------------------------+
| allocation_pools  | {"start": "10.0.1.2", "end": "10.0.1.254"} |
| cidr              | 10.0.1.0/24                                |
| dns_nameservers   |                                            |
| enable_dhcp       | True                                       |
| gateway_ip        | 10.0.1.1                                   |
| host_routes       |                                            |
| id                | d9b765da-45eb-4543-be96-1b69a00a2556       |
| ip_version        | 4                                          |
| ipv6_address_mode |                                            |
| ipv6_ra_mode      |                                            |
| name              | sb-demo-net                                |
| network_id        | c3d55c21-d8c9-4ee5-944b-560b7e0ea33b       |
| subnetpool_id     |                                            |
| tenant_id         | 75eb5efae5764682bca2fede6f4d8c6f           |
+-------------------+--------------------------------------------+

10.4.10.8 Target Project: Creating a Port Using demo-net

The owner of the port is demo2. Members of the network owner project (demo) will not see this port.

Running the following command:

ardana > openstack port create c3d55c21-d8c9-4ee5-944b-560b7e0ea33b

Creates a new port:

+-----------------------+-----------------------------------------------------------------------------------------------------+
| Field                 | Value                                                                                               |
+-----------------------+-----------------------------------------------------------------------------------------------------+
| admin_state_up        | True                                                                                                |
| allowed_address_pairs |                                                                                                     |
| binding:vnic_type     | normal                                                                                              |
| device_id             |                                                                                                     |
| device_owner          |                                                                                                     |
| dns_assignment        | {"hostname": "host-10-0-1-10", "ip_address": "10.0.1.10", "fqdn": "host-10-0-1-10.openstacklocal."} |
| dns_name              |                                                                                                     |
| fixed_ips             | {"subnet_id": "d9b765da-45eb-4543-be96-1b69a00a2556", "ip_address": "10.0.1.10"}                    |
| id                    | 03ef2dce-20dc-47e5-9160-942320b4e503                                                                |
| mac_address           | fa:16:3e:27:8d:ca                                                                                   |
| name                  |                                                                                                     |
| network_id            | c3d55c21-d8c9-4ee5-944b-560b7e0ea33b                                                                |
| security_groups       | 275802d0-33cb-4796-9e57-03d8ddd29b94                                                                |
| status                | DOWN                                                                                                |
| tenant_id             | 5a582af8b44b422fafcd4545bd2b7eb5                                                                    |
+-----------------------+-----------------------------------------------------------------------------------------------------+

10.4.10.9 Target Project Booting a VM Using Demo-Net

Here the tenant demo2 boots a VM that uses the demo-net shared network:

ardana > openstack server create --flavor 1 --image $OS_IMAGE --nic net-id=c3d55c21-d8c9-4ee5-944b-560b7e0ea33b demo2-vm-using-demo-net-nic
+--------------------------------------+------------------------------------------------+
| Property                             | Value                                          |
+--------------------------------------+------------------------------------------------+
| OS-EXT-AZ:availability_zone          |                                                |
| OS-EXT-STS:power_state               | 0                                              |
| OS-EXT-STS:task_state                | scheduling                                     |
| OS-EXT-STS:vm_state                  | building                                       |
| OS-SRV-USG:launched_at               | -                                              |
| OS-SRV-USG:terminated_at             | -                                              |
| accessIPv4                           |                                                |
| accessIPv6                           |                                                |
| adminPass                            | sS9uSv9PT79F                                   |
| config_drive                         |                                                |
| created                              | 2016-01-04T19:23:24Z                           |
| flavor                               | m1.tiny (1)                                    |
| hostId                               |                                                |
| id                                   | 3a4dc44a-027b-45e9-acf8-054a7c2dca2a           |
| image                                | cirros-0.3.3-x86_64 (6ae23432-8636-4e...1efc5) |
| key_name                             | -                                              |
| metadata                             | {}                                             |
| name                                 | demo2-vm-using-demo-net-nic                    |
| os-extended-volumes:volumes_attached | []                                             |
| progress                             | 0                                              |
| security_groups                      | default                                        |
| status                               | BUILD                                          |
| tenant_id                            | 5a582af8b44b422fafcd4545bd2b7eb5               |
| updated                              | 2016-01-04T19:23:24Z                           |
| user_id                              | a0e6427b036344fdb47162987cb0cee5               |
+--------------------------------------+------------------------------------------------+

Run openstack server list:

ardana > openstack server list

See the VM running:

+-------------------+-----------------------------+--------+------------+-------------+--------------------+
| ID                | Name                        | Status | Task State | Power State | Networks           |
+-------------------+-----------------------------+--------+------------+-------------+--------------------+
| 3a4dc...a7c2dca2a | demo2-vm-using-demo-net-nic | ACTIVE | -          | Running     | demo-net=10.0.1.11 |
+-------------------+-----------------------------+--------+------------+-------------+--------------------+

Run openstack port list:

ardana > openstask port list --device-id 3a4dc44a-027b-45e9-acf8-054a7c2dca2a

View the subnet:

+---------------------+------+-------------------+-------------------------------------------------------------------+
| id                  | name | mac_address       | fixed_ips                                                         |
+---------------------+------+-------------------+-------------------------------------------------------------------+
| 7d14ef8b-9...80348f |      | fa:16:3e:75:32:8e | {"subnet_id": "d9b765da-45...00a2556", "ip_address": "10.0.1.11"} |
+---------------------+------+-------------------+-------------------------------------------------------------------+

Run openstack port show:

ardana > openstack port show 7d14ef8b-9d48-4310-8c02-00c74d80348f
+-----------------------+-----------------------------------------------------------------------------------------------------+
| Field                 | Value                                                                                               |
+-----------------------+-----------------------------------------------------------------------------------------------------+
| admin_state_up        | True                                                                                                |
| allowed_address_pairs |                                                                                                     |
| binding:vnic_type     | normal                                                                                              |
| device_id             | 3a4dc44a-027b-45e9-acf8-054a7c2dca2a                                                                |
| device_owner          | compute:None                                                                                        |
| dns_assignment        | {"hostname": "host-10-0-1-11", "ip_address": "10.0.1.11", "fqdn": "host-10-0-1-11.openstacklocal."} |
| dns_name              |                                                                                                     |
| extra_dhcp_opts       |                                                                                                     |
| fixed_ips             | {"subnet_id": "d9b765da-45eb-4543-be96-1b69a00a2556", "ip_address": "10.0.1.11"}                    |
| id                    | 7d14ef8b-9d48-4310-8c02-00c74d80348f                                                                |
| mac_address           | fa:16:3e:75:32:8e                                                                                   |
| name                  |                                                                                                     |
| network_id            | c3d55c21-d8c9-4ee5-944b-560b7e0ea33b                                                                |
| security_groups       | 275802d0-33cb-4796-9e57-03d8ddd29b94                                                                |
| status                | ACTIVE                                                                                              |
| tenant_id             | 5a582af8b44b422fafcd4545bd2b7eb5                                                                    |
+-----------------------+-----------------------------------------------------------------------------------------------------+

10.4.10.10 Limitations

Note the following limitations of RBAC in neutron.

  • neutron network is the only supported RBAC neutron object type.

  • The "access_as_external" action is not supported – even though it is listed as a valid action by python-neutronclient.

  • The neutron-api server will not accept action value of 'access_as_external'. The access_as_external definition is not found in the specs.

  • The target project users cannot create, modify, or delete subnets on networks that have RBAC policies.

  • The subnet of a network that has an RBAC policy cannot be added as an interface of a target tenant's router. For example, the command openstack router add subnet tgt-tenant-router <sb-demo-net uuid> will error out.

  • The security group rules on the network owner do not apply to other projects that can use the network.

  • A user in target project can boot up VMs using a VNIC using the shared network. The user of the target project can assign a floating IP (FIP) to the VM. The target project must have SG rules that allows SSH and/or ICMP for VM connectivity.

  • neutron RBAC creation and management are currently not supported in horizon. For now, the neutron CLI has to be used to manage RBAC rules.

  • A RBAC rule tells neutron whether a tenant can access a network (Allow). Currently there is no DENY action.

  • Port creation on a shared network fails if --fixed-ip is specified in the openstack port create command.

10.4.11 Configuring Maximum Transmission Units in neutron

This topic explains how you can configure MTUs, what to look out for, and the results and implications of changing the default MTU settings. It is important to note that every network within a network group will have the same MTU.

Warning
Warning

An MTU change will not affect existing networks that have had VMs created on them. It will only take effect on new networks created after the reconfiguration process.

10.4.11.1 Overview

A Maximum Transmission Unit, or MTU is the maximum packet size (in bytes) that a network device can or is configured to handle. There are a number of places in your cloud where MTU configuration is relevant: the physical interfaces managed and configured by SUSE OpenStack Cloud, the virtual interfaces created by neutron and nova for neutron networking, and the interfaces inside the VMs.

SUSE OpenStack Cloud-managed physical interfaces

SUSE OpenStack Cloud-managed physical interfaces include the physical interfaces and the bonds, bridges, and VLANs created on top of them. The MTU for these interfaces is configured via the 'mtu' property of a network group. Because multiple network groups can be mapped to one physical interface, there may have to be some resolution of differing MTUs between the untagged and tagged VLANs on the same physical interface. For instance, if one untagged VLAN, vlan101 (with an MTU of 1500) and a tagged VLAN vlan201 (with an MTU of 9000) are both on one interface (eth0), this means that eth0 can handle 1500, but the VLAN interface which is created on top of eth0 (that is, vlan201@eth0) wants 9000. However, vlan201 cannot have a higher MTU than eth0, so vlan201 will be limited to 1500 when it is brought up, and fragmentation will result.

In general, a VLAN interface MTU must be lower than or equal to the base device MTU. If they are different, as in the case above, the MTU of eth0 can be overridden and raised to 9000, but in any case the discrepancy will have to be reconciled.

neutron/nova interfaces

neutron/nova interfaces include the virtual devices created by neutron and nova during the normal process of realizing a neutron network/router and booting a VM on it (qr-*, qg-*, tap-*, qvo-*, qvb-*, etc.). There is currently no support in neutron/nova for per-network MTUs in which every interface along the path for a particular neutron network has the correct MTU for that network. There is, however, support for globally changing the MTU of devices created by neutron/nova (see network_device_mtu below). This means that if you want to enable jumbo frames for any set of VMs, you will have to enable it for all your VMs. You cannot just enable them for a particular neutron network.

VM interfaces

VMs typically get their MTU via DHCP advertisement, which means that the dnsmasq processes spawned by the neutron-dhcp-agent actually advertise a particular MTU to the VMs. In SUSE OpenStack Cloud 9, the DHCP server advertises to all VMS a 1400 MTU via a forced setting in dnsmasq-neutron.conf. This is suboptimal for every network type (vxlan, flat, vlan, etc) but it does prevent fragmentation of a VM's packets due to encapsulation.

For instance, if you set the new *-mtu configuration options to a default of 1500 and create a VXLAN network, it will be given an MTU of 1450 (with the remaining 50 bytes used by the VXLAN encapsulation header) and will advertise a 1450 MTU to any VM booted on that network. If you create a provider VLAN network, it will have an MTU of 1500 and will advertise 1500 to booted VMs on the network. It should be noted that this default starting point for MTU calculation and advertisement is also global, meaning you cannot have an MTU of 8950 on one VXLAN network and 1450 on another. However, you can have provider physical networks with different MTUs by using the physical_network_mtus config option, but nova still requires a global MTU option for the interfaces it creates, thus you cannot really take advantage of that configuration option.

10.4.11.2 Network settings in the input model

MTU can be set as an attribute of a network group in network_groups.yml. Note that this applies only to KVM. That setting means that every network in the network group will be assigned the specified MTU. The MTU value must be set individually for each network group. For example:

network-groups:
        - name: GUEST
        mtu: 9000
        ...

        - name: EXTERNAL-API
        mtu: 9000
        ...

        - name: EXTERNAL-VM
        mtu: 9000
        ...

10.4.11.3 Infrastructure support for jumbo frames

If you want to use jumbo frames, or frames with an MTU of 9000 or more, the physical switches and routers that make up the infrastructure of the SUSE OpenStack Cloud installation must be configured to support them. To realize the advantages, all devices in the same broadcast domain must have the same MTU.

If you want to configure jumbo frames on compute and controller nodes, then all switches joining the compute and controller nodes must have jumbo frames enabled. Similarly, the "infrastructure gateway" through which the external VM network flows, commonly known as the default route for the external VM VLAN, must also have the same MTU configured.

You can also consider anything in the same broadcast domain to be anything in the same VLAN or anything in the same IP subnet.

10.4.11.4 Enabling end-to-end jumbo frames for a VM

  1. Add an mtu attribute to all the network groups in your model. Note that adding the MTU for the network groups will only affect the configuration for physical network interfaces.

    To add the mtu attribute, find the YAML file that contains your network-groups entry. We will assume it is network_groups.yml, unless you have changed it. Whatever the file is named, it will be found in ~/openstack/my_cloud/definition/data/.

    To edit these files, begin by checking out the site branch on the Cloud Lifecycle Manager node. You may already be on that branch. If so, you will remain there.

    ardana > cd ~/openstack/ardana/ansible
    ardana > git checkout site

    Then begin editing the files. In network_groups.yml, add mtu: 9000.

    network-groups:
                - name: GUEST
                hostname-suffix: guest
                mtu: 9000
                tags:
                - neutron.networks.vxlan

    This sets the physical interface managed by SUSE OpenStack Cloud 9 that has the GUEST network group tag assigned to it. This can be found in the interfaces_set.yml file under the interface-models section.

  2. Edit neutron.conf.j2 found in ~/openstack/my_cloud/config/neutron/ to set global_physnet_mtu to 9000 under [DEFAULT]:

    [DEFAULT]
    ...
    global_physnet_mtu = 9000

    This allows neutron to advertise the optimal MTU to instances (based on global_physnet_mtu minus the encapsulation size).

  3. Remove the dhcp-option-force=26,1400 line from ~/openstack/my_cloud/config/neutron/dnsmasq-neutron.conf.j2.

  4. OvS will set br-int to the value of the lowest physical interface. If you are using Jumbo frames on some of your networks, br-int on the controllers may be set to 1500 instead of 9000. Work around this condition by running:

    ovs-vsctl set int br-int mtu_request=9000
  5. Commit your changes

    ardana > git add -A
    ardana > git commit -m "your commit message goes here in quotes"
  6. If SUSE OpenStack Cloud has not been deployed yet, do normal deployment and skip to Step 8.

  7. Assuming it has been deployed already, continue here:

    Run the configuration processor:

    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml

    and ready the deployment:

    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml

    Then run the network_interface-reconfigure.yml playbook, changing directories first:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts network_interface-reconfigure.yml

    Then run neutron-reconfigure.yml:

    ardana > ansible-playbook -i hosts/verb_hosts neutron-reconfigure.yml

    Then nova-reconfigure.yml:

    ardana > ansible-playbook -i hosts/verb_hosts nova-reconfigure.yml

    Note: adding/changing network-group mtu settings will likely require a network restart when network_interface-reconfigure.yml is run.

  8. Follow the normal process for creating a neutron network and booting a VM or two. In this example, if a VXLAN network is created and a VM is booted on it, the VM will have an MTU of 8950, with the remaining 50 bytes used by the VXLAN encapsulation header.

  9. Test and verify that the VM can send and receive jumbo frames without fragmentation. You can use ping. For example, to test an MTU of 9000 using VXLAN:

    ardana > ping –M do –s 8950 YOUR_VM_FLOATING_IP

    Substitute your actual floating IP address for the YOUR_VM_FLOATING_IP.

10.4.11.5 Enabling Optimal MTU Advertisement Feature

To enable the optimal MTU feature, follow these steps:

  1. Edit ~/openstack/my_cloud/config/neutron/neutron.conf.j2 to remove advertise_mtu variable under [DEFAULT]

    [DEFAULT]
    ...
    advertise_mtu = False #remove this
  2. Remove the dhcp-option-force=26,1400 line from ~/openstack/my_cloud/config/neutron/dnsmasq-neutron.conf.j2.

  3. If SUSE OpenStack Cloud has already been deployed, follow the remaining steps, otherwise follow the normal deployment procedures.

  4. Commit your changes

    ardana > git add -A
    ardana > git commit -m "your commit message goes here in quotes"
  5. Run the configuration processor:

    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  6. Run ready deployment:

    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  7. Run the network_interface-reconfigure.yml playbook, changing directories first:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts network_interface-reconfigure.yml
  8. Run the neutron-reconfigure.yml playbook:

    ardana > ansible-playbook -i hosts/verb_hosts neutron-reconfigure.yml
Important
Important

If you are upgrading an existing deployment, avoid creating MTU mismatch between network interfaces in preexisting VMs and that of VMs created after upgrade. If you do have an MTU mismatch, then the new VMs (having interface with 1500 minus the underlay protocol overhead) will not be able to have L2 connectivity with preexisting VMs (with 1400 MTU due to dhcp-option-force).

10.4.12 Improve Network Peformance with Isolated Metadata Settings

In SUSE OpenStack Cloud, neutron currently sets enable_isolated_metadata = True by default in dhcp_agent.ini because several services require isolated networks (neutron networks without a router). It also sets force_metadata = True if DVR is enabled to improve the scalability on large environments with a high churn rate. However, this has the effect of spawning a neutron-ns-metadata-proxy process on one of the controller nodes for every active neutron network.

In environments that create many neutron networks, these extra neutron-ns-metadata-proxy processes can quickly eat up a lot of memory on the controllers, which does not scale up well.

For deployments that do not require isolated metadata (that is, they do not require the Platform Services and will always create networks with an attached router) and do not have a high churn rate, you can set enable_isolated_metadata = False and force_metadata = False in dhcp_agent.ini to reduce neutron memory usage on controllers, allowing a greater number of active neutron networks.

Note that the dhcp_agent.ini.j2 template is found in ~/openstack/my_cloud/config/neutron on the Cloud Lifecycle Manager node. The edit can be made there and the standard deployment can be run if this is install time. In a deployed cloud, run the neutron reconfiguration procedure outlined here:

  1. First check out the site branch:

    ardana > cd ~/openstack/my_cloud/config/neutron
    ardana > git checkout site
  2. Edit the dhcp_agent.ini.j2 file to change the enable_isolated_metadata = {{ neutron_enable_isolated_metadata }} force_metadata = {{ router_distributed }} line in the [DEFAULT] section to read:

    enable_isolated_metadata = False
    force_metadata = False
  3. Commit the file:

    ardana > git add -A
    ardana > git commit -m "your commit message goes here in quotes"
  4. Run the ready-deployment.yml playbook from ~/openstack/ardana/ansible:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  5. Then run the neutron-reconfigure.yml playbook, changing directories first:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts neutron-reconfigure.yml

10.4.13 Moving from DVR deployments to non_DVR

If you have an older deployment of SUSE OpenStack Cloud which is using DVR as a default and you are attempting to move to non_DVR, follow these steps:

  1. Remove all your existing DVR routers and their workloads. Make sure to remove interfaces, floating ips and gateways, if applicable.

    ardana > openstack router remove subnet ROUTER-NAME SUBNET-NAME/SUBNET-ID
    ardana > openstack floating ip unset –port FLOATINGIP-ID PRIVATE-PORT-ID
    ardana > openstack router unset ROUTER-NAME -NET-NAME/EXT-NET-ID
  2. Then delete the router.

    ardana > openstack router delete ROUTER-NAME
  3. Before you create any non_DVR router make sure that l3-agents and metadata-agents are not running in any compute host. You can run the command openstack network agent list to see if there are any neutron-l3-agent running in any compute-host in your deployment.

    You must disable neutron-l3-agent and neutron-metadata-agent on every compute host by running the following commands:

    ardana > openstack network agent list
    +--------------------------------------+----------------------+--------------------------+-------------------+-------+----------------+---------------------------+
    | id                                   | agent_type           | host                     | availability_zone | alive | admin_state_up | binary                    |
    +--------------------------------------+----------------------+--------------------------+-------------------+-------+----------------+---------------------------+
    | 810f0ae7-63aa-4ee3-952d-69837b4b2fe4 | L3 agent             | ardana-cp1-comp0001-mgmt | nova              | :-)   | True           | neutron-l3-agent          |
    | 89ac17ba-2f43-428a-98fa-b3698646543d | Metadata agent       | ardana-cp1-comp0001-mgmt |                   | :-)   | True           | neutron-metadata-agent    |
    | f602edce-1d2a-4c8a-ba56-fa41103d4e17 | Open vSwitch agent   | ardana-cp1-comp0001-mgmt |                   | :-)   | True           | neutron-openvswitch-agent |
    ...
    +--------------------------------------+----------------------+--------------------------+-------------------+-------+----------------+---------------------------+
    
    $ openstack network agent set --disable 810f0ae7-63aa-4ee3-952d-69837b4b2fe4
    Updated agent: 810f0ae7-63aa-4ee3-952d-69837b4b2fe4
    
    $ openstack network agent set --disable 89ac17ba-2f43-428a-98fa-b3698646543d
    Updated agent: 89ac17ba-2f43-428a-98fa-b3698646543d
    Note
    Note

    Only L3 and Metadata agents were disabled.

  4. Once L3 and metadata neutron agents are stopped, follow steps 1 through 7 in the document Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 12 “Alternative Configurations”, Section 12.2 “Configuring SUSE OpenStack Cloud without DVR” and then run the neutron-reconfigure.yml playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts neutron-reconfigure.yml

10.4.14 OVS-DPDK Support

SUSE OpenStack Cloud uses a version of Open vSwitch (OVS) that is built with the Data Plane Development Kit (DPDK) and includes a QEMU hypervisor which supports vhost-user.

The OVS-DPDK package modifes the OVS fast path, which is normally performed in kernel space, and allows it to run in userspace so there is no context switch to the kernel for processing network packets.

The EAL component of DPDK supports mapping the Network Interface Card (NIC) registers directly into userspace. The DPDK provides a Poll Mode Driver (PMD) that can access the NIC hardware from userspace and uses polling instead of interrupts to avoid the user to kernel transition.

The PMD maps the shared address space of the VM that is provided by the vhost-user capability of QEMU. The vhost-user mode causes neutron to create a Unix domain socket that allows communication between the PMD and QEMU. The PMD uses this in order to acquire the file descriptors to the pre-allocated VM memory. This allows the PMD to directly access the VM memory space and perform a fast zero-copy of network packets directly into and out of the VMs virtio_net vring.

This yields performance improvements in the time it takes to process network packets.

10.4.14.1 Usage considerations

The target for a DPDK Open vSwitch is VM performance and VMs only run on compute nodes so the following considerations are compute node specific.

  1. In order to use DPDK with VMs, hugepages must be enabled; please see Section 10.4.14.3, “Configuring Hugepages for DPDK in Networks”. The memory to be used must be allocated at boot time so you must know beforehand how many VMs will be scheduled on a node. Also, for NUMA considerations, you want those hugepages on the same NUMA node as the NIC. A VM maps its entire address space into a hugepage.

  2. For maximum performance you must reserve logical cores for DPDK poll mode driver (PMD) usage and for hypervisor (QEMU) usage. This keeps the Linux kernel from scheduling processes on those cores. The PMD threads will go to 100% cpu utilization since it uses polling of the hardware instead of interrupts. There will be at least 2 cores dedicated to PMD threads. Each VM will have a core dedicated to it although for less performance VMs can share cores.

  3. VMs can use the virtio_net or the virtio_pmd drivers. There is also a PMD for an emulated e1000.

  4. Only VMs that use hugepages can be sucessfully launched on a DPDK-enabled NIC. If there is a need to support both DPDK and non-DPDK-based VMs, an additional port managed by the Linux kernel must exist.

10.4.14.2 For more information

See the following topics for more information:

10.4.14.3 Configuring Hugepages for DPDK in Networks

To take advantage of DPDK and its network performance enhancements, enable hugepages first.

With hugepages, physical RAM is reserved at boot time and dedicated to a virtual machine. Only that virtual machine and Open vSwitch can use this specifically allocated RAM. The host OS cannot access it. This memory is contiguous, and because of its larger size, reduces the number of entries in the memory map and number of times it must be read.

The hugepage reservation is made in /etc/default/grub, but this is handled by the Cloud Lifecycle Manager.

In addition to hugepages, to use DPDK, CPU isolation is required. This is achieved with the 'isolcups' command in /etc/default/grub, but this is also managed by the Cloud Lifecycle Manager using a new input model file.

The two new input model files introduced with this release to help you configure the necessary settings and persist them are:

  • memory_models.yml (for hugepages)

  • cpu_models.yml (for CPU isolation)

10.4.14.3.1 memory_models.yml

In this file you set your huge page size along with the number of such huge-page allocations.

 ---
  product:
    version: 2

  memory-models:
    - name: COMPUTE-MEMORY-NUMA
      default-huge-page-size: 1G
      huge-pages:
        - size: 1G
          count: 24
          numa-node: 0
        - size: 1G
          count: 24
          numa-node: 1
        - size: 1G
          count: 48
10.4.14.3.2 cpu_models.yml
---
  product:
    version: 2

  cpu-models:

    - name: COMPUTE-CPU
      assignments:
       - components:
           - nova-compute-kvm
         cpu:
           - processor-ids: 3-5,12-17
             role: vm

       - components:
           - openvswitch
         cpu:
           - processor-ids: 0
             role: eal
           - processor-ids: 1-2
             role: pmd
10.4.14.3.3 NUMA memory allocation

As mentioned above, the memory used for hugepages is locked down at boot time by an entry in /etc/default/grub. As an admin, you can specify in the input model how to arrange this memory on NUMA nodes. It can be spread across NUMA nodes or you can specify where you want it. For example, if you have only one NIC, you would probably want all the hugepages memory to be on the NUMA node closest to that NIC.

If you do not specify the numa-node settings in the memory_models.yml input model file and use only the last entry indicating "size: 1G" and "count: 48" then this memory is spread evenly across all NUMA nodes.

Also note that the hugepage service runs once at boot time and then goes to an inactive state so you should not expect to see it running. If you decide to make changes to the NUMA memory allocation, you will need to reboot the compute node for the changes to take effect.

10.4.14.4 DPDK Setup for Networking

10.4.14.4.1 Hardware requirements
  • Intel-based compute node. DPDK is not available on AMD-based systems.

  • The following BIOS settings must be enabled for DL360 Gen9:

    1. Virtualization Technology

    2. Intel(R) VT-d

    3. PCI-PT (Also see Section 10.4.15.14, “Enabling PCI-PT on HPE DL360 Gen 9 Servers”)

  • Need adequate host memory to allow for hugepages. The examples below use 1G hugepages for the VMs

10.4.14.4.2 Limitations
  • DPDK is supported on SLES only.

  • Applies to SUSE OpenStack Cloud 9 only.

  • Tenant network can be untagged vlan or untagged vxlan

  • DPDK port names must be of the form 'dpdk<portid>' where port id is sequential and starts at 0

  • No support for converting DPDK ports to non DPDK ports without rebooting compute node.

  • No security group support, need userspace conntrack.

  • No jumbo frame support.

10.4.14.4.3 Setup instructions

These setup instructions and example model are for a three-host system. There is one controller with Cloud Lifecycle Manager in cloud control plane and two compute hosts.

  1. After initial run of site.yml all compute nodes must be rebooted to pick up changes in grub for hugepages and isolcpus

  2. Changes to non-uniform memory access (NUMA) memory, isolcpu, or network devices must be followed by a reboot of compute nodes

  3. Run sudo reboot to pick up libvirt change and hugepage/isocpus grub changes

    tux > sudo reboot
  4. Use the bash script below to configure nova aggregates, neutron networks, a new flavor, etc. And then it will spin up two VMs.

VM spin-up instructions

Before running the spin up script you need to get a copy of the cirros image to your Cloud Lifecycle Manager node. You can manually scp a copy of the cirros image to the system. You can copy it locallly with wget like so

ardana > wget http://download.cirros-cloud.net/0.3.4/cirros-0.3.4-x86_64-disk.img

Save the following shell script in the home directory and run it. This should spin up two VMs, one on each compute node.

Warning
Warning

Make sure to change all network-specific information in the script to match your environment.

#!/usr/bin/env bash

source service.osrc

######## register glance image
openstack image create --name='cirros' --container-format=bare --disk-format=qcow2 < ~/cirros-0.3.4-x86_64-disk.img

####### create nova aggregate and flavor for dpdk

MI_NAME=dpdk

openstack aggregate create $MI_NAME nova
openstack aggregate add host $MI_NAME openstack-cp-comp0001-mgmt
openstack aggregate add host $MI_NAME openstack-cp-comp0002-mgmt
openstack aggregate set $MI_NAME pinned=true

openstack flavor create $MI_NAME 6 1024 20 1
openstack flavor set $MI_NAME set hw:cpu_policy=dedicated
openstack flavor set $MI_NAME set aggregate_instance_extra_specs:pinned=true
openstack flavor set $MI_NAME set hw:mem_page_size=1048576

######## sec groups NOTE: no sec groups supported on DPDK.  This is in case we do non-DPDK compute hosts.
nova secgroup-add-rule default tcp 22 22 0.0.0.0/0
nova secgroup-add-rule default icmp -1 -1 0.0.0.0/0

########  nova keys
openstack keypair create mykey >mykey.pem
chmod 400 mykey.pem

######## create neutron external network
openstack network create ext-net --router:external --os-endpoint-type internalURL
openstack subnet create ext-net 10.231.0.0/19 --gateway_ip=10.231.0.1  --ip-version=4 --disable-dhcp  --allocation-pool start=10.231.17.0,end=10.231.17.255

########  neutron network
openstack network create mynet1
openstack subnet create mynet1 10.1.1.0/24 --name mysubnet1
openstack router create myrouter1
openstack router add subnet myrouter1 mysubnet1
openstack router set myrouter1 ext-net
export MYNET=$(openstack network list | grep mynet | awk '{print $2}')

######## spin up 2 VMs, 1 on each compute
openstack server create --image cirros --nic net-id=${MYNET} --key-name mykey --flavor dpdk --availability-zone nova:openstack-cp-comp0001-mgmt vm1
openstack server create --image cirros --nic net-id=${MYNET} --key-name mykey --flavor dpdk --availability-zone nova:openstack-cp-comp0002-mgmt vm2

######## create floating ip and attach to instance
export MYFIP1=$(nova floating-ip-create|grep ext-net|awk '{print $4}')
nova add-floating-ip vm1 ${MYFIP1}

export MYFIP2=$(nova floating-ip-create|grep ext-net|awk '{print $4}')
nova add-floating-ip vm2 ${MYFIP2}

openstack server list

10.4.14.5 DPDK Configurations

10.4.14.5.1 Base configuration

The following is specific to DL360 Gen9 and BIOS configuration as detailed in Section 10.4.14.4, “DPDK Setup for Networking”.

  • EAL cores - 1, isolate: False in cpu-models

  • PMD cores - 1 per NIC port

  • Hugepages - 1G per PMD thread

  • Memory channels - 4

  • Global rx queues - based on needs

10.4.14.5.2 Performance considerations common to all NIC types

Compute host core frequency

Host CPUs should be running at maximum performance. The following is a script to set that. Note that in this case there are 24 cores. This needs to be modified to fit your environment. For a HP DL360 Gen9, the BIOS should be configured to use "OS Control Mode" which can be found on the iLO Power Settings page.

for i in `seq 0 23`; do echo "performance" > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor; done

IO non-posted prefetch

The DL360 Gen9 should have the IO non-posted prefetch disabled. Experimental evidence shows this yields an additional 6-8% performance boost.

10.4.14.5.3 Multiqueue configuration

In order to use multiqueue, a property must be applied to the glance image and a setting inside the resulting VM must be applied. In this example we create a 4 vCPU flavor for DPDK using 1G hugepages.

MI_NAME=dpdk

openstack aggregate create $MI_NAME nova
openstack aggregate add host $MI_NAME openstack-cp-comp0001-mgmt
openstack aggregate add host $MI_NAME openstack-cp-comp0002-mgmt
openstack aggregate set $MI_NAME pinned=true

openstack flavor create $MI_NAME 6 1024 20 4
openstack flavor set $MI_NAME set hw:cpu_policy=dedicated
openstack flavor set $MI_NAME set aggregate_instance_extra_specs:pinned=true
openstack flavor set $MI_NAME set hw:mem_page_size=1048576

And set the hw_vif_multiqueue_enabled property on the glance image

ardana > openstack image set --property hw_vif_multiqueue_enabled=true IMAGE UUID

Once the VM is booted using the flavor above, inside the VM, choose the number of combined rx and tx queues to be equal to the number of vCPUs

tux > sudo ethtool -L eth0 combined 4

On the hypervisor you can verify that multiqueue has been properly set by looking at the qemu process

-netdev type=vhost-user,id=hostnet0,chardev=charnet0,queues=4 -device virtio-net-pci,mq=on,vectors=10,

Here you can see that 'mq=on' and vectors=10. The formula for vectors is 2*num_queues+2

10.4.14.6 Troubleshooting DPDK

10.4.14.6.1 Hardware configuration

Because there are several variations of hardware, it is up to you to verify that the hardware is configured properly.

  • Only Intel based compute nodes are supported. There is no DPDK available for AMD-based CPUs.

  • PCI-PT must be enabled for the NIC that will be used with DPDK.

  • When using Intel Niantic and the igb_uio driver, the VT-d must be enabled in the BIOS.

  • For DL360 Gen9 systems, the BIOS shared-memory Section 10.4.15.14, “Enabling PCI-PT on HPE DL360 Gen 9 Servers”.

  • Adequate memory must be available for Section 10.4.14.3, “Configuring Hugepages for DPDK in Networks” usage.

  • Hyper-threading can be enabled but is not required for base functionality.

  • Determine the PCI slot that the DPDK NIC(s) are installed in to determine the associated NUMA node.

  • Only the Intel Haswell, Broadwell, and Skylake microarchitectures are supported. Intel Sandy Bridge is not supported.

10.4.14.6.2 System configuration
  • Only SLES12-SP4 compute nodes are supported.

  • If a NIC port is used with PCI-PT, SRIOV-only, or PCI-PT+SRIOV, then it cannot be used with DPDK. They are mutually exclusive. This is because DPDK depends on an OvS bridge which does not exist if you use any combination of PCI-PT and SRIOV. You can use DPDK, SRIOV-only, and PCI-PT on difference interfaces of the same server.

  • There is an association between the PCI slot for the NIC and a NUMA node. Make sure to use logical CPU cores that are on the NUMA node associated to the NIC. Use the following to determine which CPUs are on which NUMA node.

    ardana > lscpu
    
    Architecture:          x86_64
    CPU op-mode(s):        32-bit, 64-bit
    Byte Order:            Little Endian
    CPU(s):                48
    On-line CPU(s) list:   0-47
    Thread(s) per core:    2
    Core(s) per socket:    12
    Socket(s):             2
    NUMA node(s):          2
    Vendor ID:             GenuineIntel
    CPU family:            6
    Model:                 63
    Model name:            Intel(R) Xeon(R) CPU E5-2650L v3 @ 1.80GHz
    Stepping:              2
    CPU MHz:               1200.000
    CPU max MHz:           1800.0000
    CPU min MHz:           1200.0000
    BogoMIPS:              3597.06
    Virtualization:        VT-x
    L1d cache:             32K
    L1i cache:             32K
    L2 cache:              256K
    L3 cache:              30720K
    NUMA node0 CPU(s):     0-11,24-35
    NUMA node1 CPU(s):     12-23,36-47
10.4.14.6.3 Input model configuration
  • If you do not specify a driver for a DPDK device, the igb_uio will be selected as default.

  • DPDK devices must be named dpdk<port-id> where the port-id starts at 0 and increments sequentially.

  • Tenant networks supported are untagged VXLAN and VLAN.

  • Jumbo Frames MTU is not supported with DPDK.

  • Sample VXLAN model

  • Sample VLAN model

10.4.14.6.4 Reboot requirements

A reboot of a compute node must be performed when an input model change causes the following:

  1. After the initial site.yml play on a new OpenStack environment

  2. Changes to an existing OpenStack environment that modify the /etc/default/grub file, such as

    • hugepage allocations

    • CPU isolation

    • iommu changes

  3. Changes to a NIC port usage type, such as

    • moving from DPDK to any combination of PCI-PT and SRIOV

    • moving from DPDK to kernel based eth driver

10.4.15 SR-IOV and PCI Passthrough Support

SUSE OpenStack Cloud supports both single-root I/O virtualization (SR-IOV) and PCI passthrough (PCIPT). Both technologies provide for better network performance.

This improves network I/O, decreases latency, and reduces processor overhead.

10.4.15.1 SR-IOV

A PCI-SIG Single Root I/O Virtualization and Sharing (SR-IOV) Ethernet interface is a physical PCI Ethernet NIC that implements hardware-based virtualization mechanisms to expose multiple virtual network interfaces that can be used by one or more virtual machines simultaneously. With SR-IOV based NICs, the traditional virtual bridge is no longer required. Each SR-IOV port is associated with a virtual function (VF).

When compared with a PCI Passthtrough Ethernet interface, an SR-IOV Ethernet interface:

  • Provides benefits similar to those of a PCI Passthtrough Ethernet interface, including lower latency packet processing.

  • Scales up more easily in a virtualized environment by providing multiple VFs that can be attached to multiple virtual machine interfaces.

  • Shares the same limitations, including the lack of support for LAG, QoS, ACL, and live migration.

  • Has the same requirements regarding the VLAN configuration of the access switches.

The process for configuring SR-IOV includes creating a VLAN provider network and subnet, then attaching VMs to that network.

With SR-IOV based NICs, the traditional virtual bridge is no longer required. Each SR-IOV port is associated with a virtual function (VF)

10.4.15.2 PCI passthrough Ethernet interfaces

A passthrough Ethernet interface is a physical PCI Ethernet NIC on a compute node to which a virtual machine is granted direct access. PCI passthrough allows a VM to have direct access to the hardware without being brokered by the hypervisor. This minimizes packet processing delays but at the same time demands special operational considerations. For all purposes, a PCI passthrough interface behaves as if it were physically attached to the virtual machine. Therefore any potential throughput limitations coming from the virtualized environment, such as the ones introduced by internal copying of data buffers, are eliminated. However, by bypassing the virtualized environment, the use of PCI passthrough Ethernet devices introduces several restrictions that must be taken into consideration. They include:

  • no support for LAG, QoS, ACL, or host interface monitoring

  • no support for live migration

  • no access to the compute node's OVS switch

A passthrough interface bypasses the compute node's OVS switch completely, and is attached instead directly to the provider network's access switch. Therefore, proper routing of traffic to connect the passthrough interface to a particular tenant network depends entirely on the VLAN tagging options configured on both the passthrough interface and the access port on the switch (TOR).

The access switch routes incoming traffic based on a VLAN ID, which ultimately determines the tenant network to which the traffic belongs. The VLAN ID is either explicit, as found in incoming tagged packets, or implicit, as defined by the access port's default VLAN ID when the incoming packets are untagged. In both cases the access switch must be configured to process the proper VLAN ID, which therefore has to be known in advance

10.4.15.3 Leveraging PCI Passthrough

Two parts are necessary to leverage PCI passthrough on a SUSE OpenStack Cloud 9 Compute Node: preparing the Compute Node, preparing nova and glance.

  1. Preparing the Compute Node

    1. There should be no kernel drivers or binaries with direct access to the PCI device. If there are kernel modules, they should be blacklisted.

      For example, it is common to have a nouveau driver from when the node was installed. This driver is a graphics driver for Nvidia-based GPUs. It must be blacklisted as shown in this example.

      ardana > echo 'blacklist nouveau' >> /etc/modprobe.d/nouveau-default.conf

      The file location and its contents are important; the name of the file is your choice. Other drivers can be blacklisted in the same manner, possibly including Nvidia drivers.

    2. On the host, iommu_groups is necessary and may already be enabled. To check if IOMMU is enabled:

      root # virt-host-validate
      .....
      QEMU: Checking if IOMMU is enabled by kernel
      : WARN (IOMMU appears to be disabled in kernel. Add intel_iommu=on to kernel cmdline arguments)
      .....

      To modify the kernel cmdline as suggested in the warning, edit the file /etc/default/grub and append intel_iommu=on to the GRUB_CMDLINE_LINUX_DEFAULT variable. Then run update-bootloader.

      A reboot will be required for iommu_groups to be enabled.

    3. After the reboot, check that IOMMU is enabled:

      root # virt-host-validate
      .....
      QEMU: Checking if IOMMU is enabled by kernel
      : PASS
      .....
    4. Confirm IOMMU groups are available by finding the group associated with your PCI device (for example Nvidia GPU):

      ardana > lspci -nn | grep -i nvidia
      08:00.0 VGA compatible controller [0300]: NVIDIA Corporation GT218 [NVS 300] [10de:10d8] (rev a2)
      08:00.1 Audio device [0403]: NVIDIA Corporation High Definition Audio Controller [10de:0be3] (rev a1)

      In this example, 08:00.0 and 08:00.1 are addresses of the PCI device. The vendorID is 10de. The productIDs are 10d8 and 0be3.

    5. Confirm that the devices are available for passthrough:

      ardana > ls -ld /sys/kernel/iommu_groups/*/devices/*08:00.?/
      drwxr-xr-x 3 root root 0 Feb 14 13:05 /sys/kernel/iommu_groups/20/devices/0000:08:00.0/
      drwxr-xr-x 3 root root 0 Feb 19 16:09 /sys/kernel/iommu_groups/20/devices/0000:08:00.1/
      Note
      Note

      With PCI passthrough, only an entire IOMMU group can be passed. Parts of the group cannot be passed. In this example, the IOMMU group is 20.

  2. Preparing nova and glance for passthrough

    Information about configuring nova and glance is available in the documentation at https://docs.openstack.org/nova/rocky/admin/pci-passthrough.html. Both nova-compute and nova-scheduler must be configured.

10.4.15.4 Supported Intel 82599 Devices

Table 10.1: Intel 82599 devices supported with SRIOV and PCIPT
VendorDeviceTitle
Intel Corporation10f882599 10 Gigabit Dual Port Backplane Connection
Intel Corporation10f982599 10 Gigabit Dual Port Network Connection
Intel Corporation10fb82599ES 10-Gigabit SFI/SFP+ Network Connection
Intel Corporation10fc82599 10 Gigabit Dual Port Network Connection

10.4.15.5 SRIOV PCIPT configuration

If you plan to take advantage of SR-IOV support in SUSE OpenStack Cloud, plan in advance to meet the following requirements:

  1. Use one of the supported NIC cards:

    • HP Ethernet 10Gb 2-port 560FLR-SFP+ Adapter (Intel Niantic). Product part number: 665243-B21 -- Same part number for the following card options:

      • FlexLOM card

      • PCI slot adapter card

  2. Identify the NIC ports to be used for PCI Passthrough devices and SRIOV devices from each compute node

  3. Ensure that:

    • SRIOV is enabled in the BIOS

    • HP Shared memory is disabled in the BIOS on the compute nodes.

    • The Intel boot agent is disabled on the compute (Section 10.4.15.11, “Intel bootutils” can be used to perform this)

    Note
    Note

    Because of Intel driver limitations, you cannot use a NIC port as an SRIOV NIC as well as a physical NIC. Using the physical function to carry the normal tenant traffic through the OVS bridge at the same time as assigning the VFs from the same NIC device as passthrough to the guest VM is not supported.

If the above prerequisites are met, then SR-IOV or PCIPT can be reconfigured at any time. There is no need to do it at install time.

10.4.15.6 Deployment use cases

The following are typical use cases that should cover your particular needs:

  1. A device on the host needs to be enabled for both PCI-passthrough and PCI-SRIOV during deployment. At run time nova decides whether to use physical functions or virtual function depending on vnic_type of the port used for booting the VM.

  2. A device on the host needs to be configured only for PCI-passthrough.

  3. A device on the host needs to be configured only for PCI-SRIOV virtual functions.

10.4.15.7 Input model updates

SUSE OpenStack Cloud 9 provides various options for the user to configure the network for tenant VMs. These options have been enhanced to support SRIOV and PCIPT.

the Cloud Lifecycle Manager input model changes to support SRIOV and PCIPT are as follows. If you were familiar with the configuration settings previously, you will notice these changes.

net_interfaces.yml: This file defines the interface details of the nodes. In it, the following fields have been added under the compute node interface section:

KeyValue
sriov_only:

Indicates that only SR-IOV be enabled on the interface. This should be set to true if you want to dedicate the NIC interface to support only SR-IOV functionality.

pci-pt:

When this value is set to true, it indicates that PCIPT should be enabled on the interface.

vf-count:

Indicates the number of VFs to be configured on a given interface.

In control_plane.yml under Compute resource, neutron-sriov-nic-agent has been added as a service component.

under resources:

KeyValue
name: Compute
resource-prefix: Comp
server-role:COMPUTE-ROLE
allocation-policy: Any
min-count: 0
service-components:ntp-client
 nova-compute
 nova-compute-kvm
 neutron-l3-agent
 neutron-metadata-agent
 neutron-openvswitch-agent
 - neutron-sriov-nic-agent*

nic_device_data.yml: This is the new file added with this release to support SRIOV and PCIPT configuration details. It contains information about the specifics of a nic, and is found at /usr/share/ardana/input-model/2.0/services/osconfig/nic_device_data.yml. The fields in this file are as follows.

  1. nic-device-types: The nic-device-types section contains the following key-value pairs:

    KeyValue
    name:

    The name of the nic-device-types that will be referenced in nic_mappings.yml

    family:

    The name of the nic-device-families to be used with this nic_device_type

    device_id:

    Device ID as specified by the vendor for the particular NIC

    type:

    The value of this field can be simple-port or multi-port. If a single bus address is assigned to more than one nic, The value will be multi-port. If there is a one-to-one mapping between bus address and the nic, it will be simple-port.

  2. nic-device-families: The nic-device-families section contains the following key-value pairs:

    KeyValue
    name:

    The name of the device family that can be used for reference in nic-device-types.

    vendor-id:

    Vendor ID of the NIC

    config-script:

    A script file used to create the virtual functions (VF) on the Compute node.

    driver:

    Indicates the NIC driver that needs to be used.

    vf-count-type:

    This value can be either port or driver.

    “port”:

    Indicates that the device supports per-port virtual function (VF) counts.

    “driver:”

    Indicates that all ports using the same driver will be configured with the same number of VFs, whether or not the interface model specifies a vf-count attribute for the port. If two or more ports specify different vf-count values, the config processor errors out.

    Max-vf-count:

    This field indicates the maximum VFs that can be configured on an interface as defined by the vendor.

control_plane.yml: This file provides the information about the services to be run on a particular node. To support SR-IOV on a particular compute node, you must run neutron-sriov-nic-agent on that node.

Mapping the use cases with various fields in input model

 Vf-countSR-IOVPCIPTOVS bridgeCan be NIC bondedUse case
sriov-only: trueMandatoryYesNoNoNoDedicated to SRIOV
pci-pt : trueNot SpecifiedNoYesNoNoDedicated to PCI-PT
pci-pt : trueSpecifiedYesYesNoNoPCI-PT or SRIOV
pci-pt and sriov-only keywords are not specifiedSpecifiedYesNoYesNoSRIOV with PF used by host
pci-pt and sriov-only keywords are not specifiedNot SpecifiedNoNoYesYesTraditional/Usual use case

10.4.15.8 Mappings between nic_mappings.yml and net_interfaces.yml

The following diagram shows which fields in nic_mappings.yml map to corresponding fields in net_interfaces.yml:

Image

10.4.15.9 Example Use Cases for Intel

  1. Nic-device-types and nic-device-families with Intel 82559 with ixgbe as the driver.

    nic-device-types:
        - name: ''8086:10fb
          family: INTEL-82599
          device-id: '10fb'
          type: simple-port
    nic-device-families:
        # Niantic
        - name: INTEL-82599
          vendor-id: '8086'
          config-script: intel-82599.sh
          driver: ixgbe
          vf-count-type: port
          max-vf-count: 63
  2. net_interfaces.yml for the SRIOV-only use case:

    - name: COMPUTE-INTERFACES
       - name: hed1
         device:
           name: hed1
           sriov-only: true
           vf-count: 6
         network-groups:
          - GUEST1
  3. net_interfaces.yml for the PCIPT-only use case:

    - name: COMPUTE-INTERFACES
       - name: hed1
         device:
           name: hed1
           pci-pt: true
        network-groups:
         - GUEST1
  4. net_interfaces.yml for the SRIOV and PCIPT use case

     - name: COMPUTE-INTERFACES
        - name: hed1
          device:
            name: hed1
            pci-pt: true
            vf-count: 6
          network-groups:
          - GUEST1
  5. net_interfaces.yml for SRIOV and Normal Virtio use case

    - name: COMPUTE-INTERFACES
       - name: hed1
         device:
            name: hed1
            vf-count: 6
          network-groups:
          - GUEST1
  6. net_interfaces.yml for PCI-PT (hed1 and hed4 refer to the DUAL ports of the PCI-PT NIC)

        - name: COMPUTE-PCI-INTERFACES
          network-interfaces:
          - name: hed3
            device:
              name: hed3
            network-groups:
              - MANAGEMENT
              - EXTERNAL-VM
            forced-network-groups:
              - EXTERNAL-API
          - name: hed1
            device:
              name: hed1
              pci-pt: true
            network-groups:
              - GUEST
          - name: hed4
            device:
              name: hed4
              pci-pt: true
            network-groups:
              - GUEST

10.4.15.10 Launching Virtual Machines

Provisioning a VM with SR-IOV NIC is a two-step process.

  1. Create a neutron port with vnic_type = direct.

    ardana > openstack port create --network $net_id --vnic-type direct sriov_port
  2. Boot a VM with the created port-id.

    ardana > openstack server create --flavor m1.large --image opensuse --nic port-id=$port_id test-sriov

Provisioning a VM with PCI-PT NIC is a two-step process.

  1. Create two neutron ports with vnic_type = direct-physical.

    ardana > openstack port create --network net1 --vnic-type direct-physical pci-port1
    ardana > openstack port create --network net1 --vnic-type direct-physical pci-port2
  2. Boot a VM with the created ports.

    ardana > openstack server create --flavor 4 --image opensuse --nic port-id pci-port1-port-id \
    --nic port-id pci-port2-port-id vm1-pci-passthrough

If PCI-PT VM gets stuck (hangs) at boot time when using an Intel NIC, the boot agent should be disabled.

10.4.15.11 Intel bootutils

When Intel cards are used for PCI-PT, a tenant VM can get stuck at boot time. When this happens, you should download Intel bootutils and use it to should disable bootagent.

  1. Download Preboot.tar.gz from https://downloadcenter.intel.com/download/19186/Intel-Ethernet-Connections-Boot-Utility-Preboot-Images-and-EFI-Drivers

  2. Untar the Preboot.tar.gz on the compute node where the PCI-PT VM is to be hosted.

  3. Go to ~/APPS/BootUtil/Linux_x64

    ardana > cd ~/APPS/BootUtil/Linux_x64

    and run following command:

    ardana > ./bootutil64e -BOOTENABLE disable -all
  4. Boot the PCI-PT VM; it should boot without getting stuck.

    Note
    Note

    Even though VM console shows VM getting stuck at PXE boot, it is not related to BIOS PXE settings.

10.4.15.12 Making input model changes and implementing PCI PT and SR-IOV

To implement the configuration you require, log into the Cloud Lifecycle Manager node and update the Cloud Lifecycle Manager model files to enable SR-IOV or PCIPT following the relevant use case explained above. You will need to edit the following:

  • net_interfaces.yml

  • nic_device_data.yml

  • control_plane.yml

To make the edits,

  1. Check out the site branch of the local git repository and change to the correct directory:

    ardana > git checkout site
    ardana > cd ~/openstack/my_cloud/definition/data/
  2. Open each file in vim or another editor and make the necessary changes. Save each file, then commit to the local git repository:

    ardana > git add -A
    ardana > git commit -m "your commit message goes here in quotes"
  3. Have the Cloud Lifecycle Manager enable your changes by running the necessary playbooks:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts site.yml
Note
Note

After running the site.yml playbook above, you must reboot the compute nodes that are configured with Intel PCI devices.

Note
Note

When a VM is running on an SRIOV port on a given compute node, reconfiguration is not supported.

You can set the number of virtual functions that must be enabled on a compute node at install time. You can update the number of virtual functions after deployment. If any VMs have been spawned before you change the number of virtual functions, those VMs may lose connectivity. Therefore, it is always recommended that if any virtual function is used by any tenant VM, you should not reconfigure the virtual functions. Instead, you should delete/migrate all the VMs on that NIC before reconfiguring the number of virtual functions.

10.4.15.13 Limitations

  • Security groups are not applicable for PCI-PT and SRIOV ports.

  • Live migration is not supported for VMs with PCI-PT and SRIOV ports.

  • Rate limiting (QoS) is not applicable on SRIOV and PCI-PT ports.

  • SRIOV/PCIPT is not supported for VxLAN network.

  • DVR is not supported with SRIOV/PCIPT.

  • For Intel cards, the same NIC cannot be used for both SRIOV and normal VM boot.

  • Current upstream OpenStack code does not support this hot plugin of SRIOV/PCIPT interface using the nova attach_interface command. See https://review.openstack.org/#/c/139910/ for more information.

  • The openstack port update command will not work when admin state is down.

  • SLES Compute Nodes with dual-port PCI-PT NICs, both ports should always be passed in the VM. It is not possible to split the dual port and pass through just a single port.

10.4.15.14 Enabling PCI-PT on HPE DL360 Gen 9 Servers

The HPE DL360 Gen 9 and HPE ProLiant systems with Intel processors use a region of system memory for sideband communication of management information. The BIOS sets up Reserved Memory Region Reporting (RMRR) to report these memory regions and devices to the operating system. There is a conflict between the Linux kernel and RMRR which causes problems with PCI pass-through (PCI-PT). This is needed for IOMMU use by DPDK. Note that this does not affect SR-IOV.

In order to enable PCI-PT on the HPE DL360 Gen 9 you must have a version of firmware that supports setting this and you must change a BIOS setting.

To begin, get the latest firmware and install it on your compute nodes.

Once the firmware has been updated:

  1. Reboot the server and press F9 (system utilities) during POST (power on self test)

  2. Choose System Configuration

  3. Select the NIC for which you want to enable PCI-PT

  4. Choose Device Level Configuration

  5. Disable the shared memory feature in the BIOS.

  6. Save the changes and reboot server

10.4.16 Setting up VLAN-Aware VMs

Creating a VM with a trunk port will allow a VM to gain connectivity to one or more networks over the same virtual NIC (vNIC) through the use VLAN interfaces in the guest VM. Connectivity to different networks can be added and removed dynamically through the use of subports. The network of the parent port will be presented to the VM as the untagged VLAN, and the networks of the child ports will be presented to the VM as the tagged VLANs (the VIDs of which can be chosen arbitrarily as long as they are unique to that trunk). The VM will send/receive VLAN-tagged traffic over the subports, and neutron will mux/demux the traffic onto the subport's corresponding network. This is not to be confused with VLAN transparency where a VM can pass VLAN-tagged traffic transparently across the network without interference from neutron. VLAN transparency is not supported.

10.4.16.1 Terminology

  • Trunk: a resource that logically represents a trunked vNIC and references a parent port.

  • Parent port: a neutron port that a Trunk is referenced to. Its network is presented as the untagged VLAN.

  • Subport: a resource that logically represents a tagged VLAN port on a Trunk. A Subport references a child port and consists of the <port>,<segmentation-type>,<segmentation-id> tuple. Currently only the vlan segmentation type is supported.

  • Child port: a neutron port that a Subport is referenced to. Its network is presented as a tagged VLAN based upon the segmentation-id used when creating/adding a Subport.

  • Legacy VM: a VM that does not use a trunk port.

  • Legacy port: a neutron port that is not used in a Trunk.

  • VLAN-aware VM: a VM that uses at least one trunk port.

10.4.16.2 Trunk CLI reference

CommandAction
network trunk create Create a trunk.
network trunk delete Delete a given trunk.
network trunk list List all trunks.
network trunk show Show information of a given trunk.
network trunk set Add subports to a given trunk.
network subport list List all subports for a given trunk.
network trunk unsetRemove subports from a given trunk.
network trunk setUpdate trunk properties.

10.4.16.3 Enabling VLAN-aware VM capability

  1. Edit ~/openstack/my_cloud/config/neutron/neutron.conf.j2 to add the trunk service_plugin:

    service_plugins = {{ neutron_service_plugins }},trunk
  2. Edit ~/openstack/my_cloud/config/neutron/ml2_conf.ini.j2 to enable the noop firewall driver:

    [securitygroup]
    firewall_driver = neutron.agent.firewall.NoopFirewallDriver
    Note
    Note

    This is a manual configuration step because it must be made apparent that this step disables neutron security groups completely. The default SUSE OpenStack Cloud firewall_driver is neutron.agent.linux.iptables_firewall.OVSHybridIptablesFirewall Driver which does not implement security groups for trunk ports. Optionally, the SUSE OpenStack Cloud default firewall_driver may still be used (this step can be skipped), which would provide security groups for legacy VMs but not for VLAN-aware VMs. However, this mixed environment is not recommended. For more information, see Section 10.4.16.6, “Firewall issues”.

  3. Commit the configuration changes:

    ardana > git add -A
    ardana > git commit -m "Enable vlan-aware VMs"
    ardana > cd ~/openstack/ardana/ansible/
  4. If this is an initial deployment, continue the rest of normal deployment process:

    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts site.yml
  5. If the cloud has already been deployed and this is a reconfiguration:

    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts neutron-reconfigure.yml

10.4.16.4 Use Cases

Creating a trunk port

Assume that a number of neutron networks/subnets already exist: private, foo-net, and bar-net. This will create a trunk with two subports allocated to it. The parent port will be on the "private" network, while the two child ports will be on "foo-net" and "bar-net", respectively:

  1. Create a port that will function as the trunk's parent port:

    ardana > openstack port create --name trunkparent private
  2. Create ports that will function as the child ports to be used in subports:

    ardana > openstack port create --name subport1 foo-net
    ardana > openstack port create --name subport2 bar-net
  3. Create a trunk port using the openstack network trunk create command, passing the parent port created in step 1 and child ports created in step 2:

    ardana > openstack network trunk create --parent-port trunkparent --subport port=subport1,segmentation-type=vlan,segmentation-id=1 --subport port=subport2,segmentation-type=vlan,segmentation-id=2 mytrunk
    +-----------------+-----------------------------------------------------------------------------------------------+
    | Field           | Value                                                                                         |
    +-----------------+-----------------------------------------------------------------------------------------------+
    | admin_state_up  | UP                                                                                            |
    | created_at      | 2017-06-02T21:49:59Z                                                                          |
    | description     |                                                                                               |
    | id              | bd822ebd-33d5-423e-8731-dfe16dcebac2                                                          |
    | name            | mytrunk                                                                                       |
    | port_id         | 239f8807-be2e-4732-9de6-c64519f46358                                                          |
    | project_id      | f51610e1ac8941a9a0d08940f11ed9b9                                                              |
    | revision_number | 1                                                                                             |
    | status          | DOWN                                                                                          |
    | sub_ports       | port_id='9d25abcf-d8a4-4272-9436-75735d2d39dc', segmentation_id='1', segmentation_type='vlan' |
    |                 | port_id='e3c38cb2-0567-4501-9602-c7a78300461e', segmentation_id='2', segmentation_type='vlan' |
    | tenant_id       | f51610e1ac8941a9a0d08940f11ed9b9                                                              |
    | updated_at      | 2017-06-02T21:49:59Z                                                                          |
    +-----------------+-----------------------------------------------------------------------------------------------+
    
    $ openstack network subport list --trunk mytrunk
    +--------------------------------------+-------------------+-----------------+
    | Port                                 | Segmentation Type | Segmentation ID |
    +--------------------------------------+-------------------+-----------------+
    | 9d25abcf-d8a4-4272-9436-75735d2d39dc | vlan              |               1 |
    | e3c38cb2-0567-4501-9602-c7a78300461e | vlan              |               2 |
    +--------------------------------------+-------------------+-----------------+

    Optionally, a trunk may be created without subports (they can be added later):

    ardana > openstack network trunk create --parent-port trunkparent mytrunk
    +-----------------+--------------------------------------+
    | Field           | Value                                |
    +-----------------+--------------------------------------+
    | admin_state_up  | UP                                   |
    | created_at      | 2017-06-02T21:45:35Z                 |
    | description     |                                      |
    | id              | eb8a3c7d-9f0a-42db-b26a-ca15c2b38e6e |
    | name            | mytrunk                              |
    | port_id         | 239f8807-be2e-4732-9de6-c64519f46358 |
    | project_id      | f51610e1ac8941a9a0d08940f11ed9b9     |
    | revision_number | 1                                    |
    | status          | DOWN                                 |
    | sub_ports       |                                      |
    | tenant_id       | f51610e1ac8941a9a0d08940f11ed9b9     |
    | updated_at      | 2017-06-02T21:45:35Z                 |
    +-----------------+--------------------------------------+

    A port that is already bound (that is, already in use by a VM) cannot be upgraded to a trunk port. The port must be unbound to be eligible for use as a trunk's parent port. When adding subports to a trunk, the child ports must be unbound as well.

Checking a port's trunk details

Once a trunk has been created, its parent port will show the trunk_details attribute, which consists of the trunk_id and list of subport dictionaries:

ardana > openstack port show -F trunk_details trunkparent
+---------------+-------------------------------------------------------------------------------------+
| Field         | Value                                                                               |
+---------------+-------------------------------------------------------------------------------------+
| trunk_details | {"trunk_id": "bd822ebd-33d5-423e-8731-dfe16dcebac2", "sub_ports":                   |
|               | [{"segmentation_id": 2, "port_id": "e3c38cb2-0567-4501-9602-c7a78300461e",          |
|               | "segmentation_type": "vlan", "mac_address": "fa:16:3e:11:90:d2"},                   |
|               | {"segmentation_id": 1, "port_id": "9d25abcf-d8a4-4272-9436-75735d2d39dc",           |
|               | "segmentation_type": "vlan", "mac_address": "fa:16:3e:ff:de:73"}]}                  |
+---------------+-------------------------------------------------------------------------------------+

Ports that are not trunk parent ports will not have a trunk_details field:

ardana > openstack port show -F trunk_details subport1
need more than 0 values to unpack

Adding subports to a trunk

Assuming a trunk and new child port have been created already, the trunk-subport-add command will add one or more subports to the trunk.

  1. Run openstack network trunk set

    ardana > openstack network trunk set --subport port=subport3,segmentation-type=vlan,segmentation-id=3 mytrunk
  2. Run openstack network subport list

    ardana > openstack network subport list --trunk mytrunk
    +--------------------------------------+-------------------+-----------------+
    | Port                                 | Segmentation Type | Segmentation ID |
    +--------------------------------------+-------------------+-----------------+
    | 9d25abcf-d8a4-4272-9436-75735d2d39dc | vlan              |               1 |
    | e3c38cb2-0567-4501-9602-c7a78300461e | vlan              |               2 |
    | bf958742-dbf9-467f-b889-9f8f2d6414ad | vlan              |               3 |
    +--------------------------------------+-------------------+-----------------+
Note
Note

The --subport option may be repeated multiple times in order to add multiple subports at a time.

Removing subports from a trunk

To remove a subport from a trunk, use openstack network trunk unset command:

ardana > openstack network trunk unset --subport subport3 mytrunk

Deleting a trunk port

To delete a trunk port, use the openstack network trunk delete command:

ardana > openstack network trunk delete mytrunk

Once a trunk has been created successfully, its parent port may be passed to the openstack server create command, which will make the VM VLAN-aware:

ardana > openstack server create --image ubuntu-server --flavor 1 --nic port-id=239f8807-be2e-4732-9de6-c64519f46358 vlan-aware-vm
Note
Note

A trunk cannot be deleted until its parent port is unbound. This means you must delete the VM using the trunk port before you are allowed to delete the trunk.

10.4.16.5 VLAN-aware VM network configuration

This section illustrates how to configure the VLAN interfaces inside a VLAN-aware VM based upon the subports allocated to the trunk port being used.

  1. Run openstack network trunk subport list to see the VLAN IDs in use on the trunk port:

    ardana > openstack network subport list --trunk mytrunk
    +--------------------------------------+-------------------+-----------------+
    | Port                                 | Segmentation Type | Segmentation ID |
    +--------------------------------------+-------------------+-----------------+
    | e3c38cb2-0567-4501-9602-c7a78300461e | vlan              |               2 |
    +--------------------------------------+-------------------+-----------------+
  2. Run openstack port show on the child port to get its mac_address:

    ardana > openstack port show -F mac_address 08848e38-50e6-4d22-900c-b21b07886fb7
    +-------------+-------------------+
    | Field       | Value             |
    +-------------+-------------------+
    | mac_address | fa:16:3e:08:24:61 |
    +-------------+-------------------+
  3. Log into the VLAN-aware VM and run the following commands to set up the VLAN interface:

    ardana > sudo ip link add link ens3 ens3.2 address fa:16:3e:11:90:d2 broadcast ff:ff:ff:ff:ff:ff type vlan id 2
    $ sudo ip link set dev ens3.2 up
  4. Note the usage of the mac_address from step 2 and VLAN ID from step 1 in configuring the VLAN interface:

    ardana > sudo ip link add link ens3 ens3.2 address fa:16:3e:11:90:d2 broadcast ff:ff:ff:ff:ff:ff type vlan id 2
  5. Trigger a DHCP request for the new vlan interface to verify connectivity and retrieve its IP address. On an Ubuntu VM, this might be:

    ardana > sudo dhclient ens3.2
    ardana > sudo ip a
    1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
        link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
        inet 127.0.0.1/8 scope host lo
           valid_lft forever preferred_lft forever
        inet6 ::1/128 scope host
           valid_lft forever preferred_lft forever
    2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc pfifo_fast state UP group default qlen 1000
        link/ether fa:16:3e:8d:77:39 brd ff:ff:ff:ff:ff:ff
        inet 10.10.10.5/24 brd 10.10.10.255 scope global ens3
           valid_lft forever preferred_lft forever
        inet6 fe80::f816:3eff:fe8d:7739/64 scope link
           valid_lft forever preferred_lft forever
    3: ens3.2@ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default qlen 1000
        link/ether fa:16:3e:11:90:d2 brd ff:ff:ff:ff:ff:ff
        inet 10.10.12.7/24 brd 10.10.12.255 scope global ens3.2
           valid_lft forever preferred_lft forever
        inet6 fe80::f816:3eff:fe11:90d2/64 scope link
           valid_lft forever preferred_lft forever

10.4.16.6 Firewall issues

The SUSE OpenStack Cloud default firewall_driver is neutron.agent.linux.iptables_firewall.OVSHybridIptablesFirewallDriver. This default does not implement security groups for VLAN-aware VMs, but it does implement security groups for legacy VMs. For this reason, it is recommended to disable neutron security groups altogether when using VLAN-aware VMs. To do so, set:

firewall_driver = neutron.agent.firewall.NoopFirewallDriver

Doing this will prevent having a mix of firewalled and non-firewalled VMs in the same environment, but it should be done with caution because all VMs would be non-firewalled.

10.5 Creating a Highly Available Router

10.5.1 CVR and DVR High Available Routers

CVR (Centralized Virtual Routing) and DVR (Distributed Virtual Routing) are two types of technologies which can be used to provide routing processes in SUSE OpenStack Cloud 9. You can create Highly Available (HA) versions of CVR and DVR routers by using the options in the table below when creating your router.

The neutron command for creating a router openstack router create router_name --distributed=True|False --ha=True|False requires administrative permissions. See the example in the next section, Section 10.5.2, “Creating a High Availability Router”.

--distributed--haRouter TypeDescription
FalseFalseCVRCentralized Virtual Router
FalseTrueCVRHACentralized Virtual Router with L3 High Availablity
TrueFalseDVRDistributed Virtual Router without SNAT High Availability
TrueTrueDVRHADistributed Virtual Router with SNAT High Availability

10.5.2 Creating a High Availability Router

You can create a highly available router using the OpenStackClient.

  1. To create the HA router, add --ha=True to the openstack router create command. If you also want to make the router distributed, add --distributed=True. In this example, a DVR SNAT HA router is created with the name routerHA.

    ardana > openstack router create routerHA --distributed=True --ha=True
  2. Set the gateway for the external network and add interface

    ardana > openstack router set routerHA <ext-net-id>
    ardana > openstack router add subnet routerHA <private_subnet_id>
  3. When the router is created, the gateway is set, and the interface attached, you have a router with high availability.

10.5.3 Test Router for High Availability

You can demonstrate that the router is HA by running a continuous ping from a VM instance that is running on the private network to an external server such as a public DNS. As the ping is running, list the l3 agents hosting the router and identify the agent that is responsible for hosting the active router. Induce the failover mechanism by creating a catastrophic event such as shutting down node hosting the l3 agent. Once the node is shut down, you will see that the ping from the VM to the external network continues to run as the backup l3 agent takes over. To verify the agent hosting the primary router has changed, list the agents hosting the router. You will see a different agent is now hosting the active router.

  1. Boot an instance on the private network

    ardana > openstack server create --image <image_id> --flavor <flavor_id> --nic net_id=<private_net_id> --key_name <key> VM1
  2. Log into the VM using the SSH keys

    ardana > ssh -i <key> <ipaddress of VM1>
  3. Start a ping to X.X.X.X. While pinging, make sure there is no packet loss and leave the ping running.

    ardana > ping X.X.X.X
  4. Check which agent is hosting the active router.

    ardana > openstack network agent list –routers <router_id>
  5. Shutdown the node hosting the agent.

  6. Within 10 seconds, check again to see which L3 agent is hosting the active router.

    ardana > openstack network agent list –routers <router_id>
  7. You will see a different agent.

11 Managing the Dashboard

Information about managing and configuring the Dashboard service.

11.1 Configuring the Dashboard Service

horizon is the OpenStack service that serves as the basis for the SUSE OpenStack Cloud dashboards.

The dashboards provide a web-based user interface to SUSE OpenStack Cloud services including Compute, Volume Operations, Networking, and Identity.

Along the left side of the dashboard are sections that provide access to Project and Identity sections. If your login credentials have been assigned the 'admin' role you will also see a separate Admin section that provides additional system-wide setting options.

Across the top are menus to switch between projects and menus where you can access user settings.

11.1.1 Dashboard Service and TLS in SUSE OpenStack Cloud

By default, the Dashboard service is configured with TLS in the input model (ardana-input-model). You should not disable TLS in the input model for the Dashboard service. The normal use case for users is to have all services behind TLS, but users are given the freedom in the input model to take a service off TLS for troubleshooting or debugging. TLS should always be enabled for production environments.

Make sure that horizon_public_protocol and horizon_private_protocol are both be set to use https.

11.2 Changing the Dashboard Timeout Value

The default session timeout for the dashboard is 1800 seconds or 30 minutes. This is the recommended default and best practice for those concerned with security.

As an administrator, you can change the session timeout by changing the value of the SESSION_TIMEOUT to anything less than or equal to 14400, which is equal to four hours. Values greater than 14400 should not be used due to keystone constraints.

Warning
Warning

Increasing the value of SESSION_TIMEOUT increases the risk of abuse.

11.2.1 How to Change the Dashboard Timeout Value

Follow these steps to change and commit the horizon timeout value.

  1. Log in to the Cloud Lifecycle Manager.

  2. Edit the Dashboard config file at ~/openstack/my_cloud/config/horizon/local_settings.py and, if it is not already present, add a line for SESSION_TIMEOUT above the line for SESSION_ENGINE.

    Here is an example snippet, in bold:

    SESSION_TIMEOUT = <timeout value>
    SESSION_ENGINE = 'django.contrib.sessions.backends.db'
    Important
    Important

    Do not exceed the maximum value of 14400.

  3. Commit the changes to git:

    git add -A
    git commit -a -m "changed horizon timeout value"
  4. Run the configuration processor:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost config-processor-run.yml
  5. Update your deployment directory:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost ready-deployment.yml
  6. Run the Dashboard reconfigure playbook:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts horizon-reconfigure.yml

11.3 Creating a Load Balancer with the Dashboard

In SUSE OpenStack Cloud 9 you can create a Load Balancer with the Load Balancer Panel in the Dashboard.

Follow the steps below to create the load balancer, listener, pool, add members to the pool and create the health monitor.

Note
Note

Optionally, you may add members to the load balancer pool after the load balancer has been created.

  1. Login to the Dashboard

    Login into the Dashboard using your domain, user account and password.

  2. Navigate and Create Load Balancer

    Once logged into the Dashboard, navigate to the Load Balancers panel by selecting Project › Network › Load Balancers in the navigation menu, then select Create Load Balancer from the Load Balancers page.

  3. Create Load Balancer

    Provide the Load Balancer details, Load Balancer Name, Description (optional), IP Address and Subnet. When complete, select Next.

    Image
  4. Create Listener

    Provide a Name, Description, Protocol (HTTP, TCP, TERMINATED_HTTPS) and Port for the Load Balancer Listener.

    Image
  5. Create Pool

    Provide the Name, Description and Method (LEAST_CONNECTIONS, ROUND_ROBIN, SOURCE_IP) for the Load Balancer Pool.

    Image
  6. Add Pool Members

    Add members to the Load Balancer Pool.

    Note
    Note

    Optionally, you may add members to the load balancer pool after the load balancer has been created.

    Image
  7. Create Health Monitor

    Create Health Monitor by providing the Monitor type (HTTP, PING, TCP), the Health check interval, Retry count, timeout, HTTP Method, Expected HTTP status code and the URL path. Once all fields are filled, select Create Load Balancer.

    Image
  8. Load Balancer Provisioning Status

    Clicking on the Load Balancers tab again will provide the status of the Load Balancer. The Load Balancer will be in Pending Create until the Load Balancer is created, at which point the Load Balancer will change to an Active state.

    Image
  9. Load Balancer Overview

    Once Load Balancer 1 has been created, it will appear in the Load Balancers list. Click the Load Balancer 1, it will show the Overview. In this view, you can see the Load Balancer Provider type, the Admin State, Floating IP, Load Balancer, Subnet and Port ID's.

    Image

12 Managing Orchestration

Information about managing and configuring the Orchestration service, based on OpenStack heat.

12.1 Configuring the Orchestration Service

Information about configuring the Orchestration service, based on OpenStack heat.

The Orchestration service, based on OpenStack heat, does not need any additional configuration to be used. This documenent describes some configuration options as well as reasons you may want to use them.

heat Stack Tag Feature

heat provides a feature called Stack Tags to allow attributing a set of simple string-based tags to stacks and optionally the ability to hide stacks with certain tags by default. This feature can be used for behind-the-scenes orchestration of cloud infrastructure, without exposing the cloud user to the resulting automatically-created stacks.

Additional details can be seen here: OpenStack - Stack Tags.

In order to use the heat stack tag feature, you need to use the following steps to define the hidden_stack_tags setting in the heat configuration file and then reconfigure the service to enable the feature.

  1. Log in to the Cloud Lifecycle Manager.

  2. Edit the heat configuration file, at this location:

    ~/openstack/my_cloud/config/heat/heat.conf.j2
  3. Under the [DEFAULT] section, add a line for hidden_stack_tags. Example:

    [DEFAULT]
    hidden_stack_tags="<hidden_tag>"
  4. Commit the changes to your local git:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add --all
    ardana > git commit -m "enabling heat Stack Tag feature"
  5. Run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  6. Update your deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  7. Reconfigure the Orchestration service:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts heat-reconfigure.yml

To begin using the feature, use these steps to create a heat stack using the defined hidden tag. You will need to use credentials that have the heat admin permissions. In the example steps below we are going to do this from the Cloud Lifecycle Manager using the admin credentials and a heat template named heat.yaml:

  1. Log in to the Cloud Lifecycle Manager.

  2. Source the admin credentials:

    ardana > source ~/service.osrc
  3. Create a heat stack using this feature:

    ardana > openstack stack create -f heat.yaml hidden_stack_tags --tags hidden
  4. If you list your heat stacks, your hidden one will not show unless you use the --hidden switch.

    Example, not showing hidden stacks:

    ardana > openstack stack list

    Example, showing the hidden stacks:

    ardana > openstack stack list --hidden

12.2 Autoscaling using the Orchestration Service

Autoscaling is a process that can be used to scale up and down your compute resources based on the load they are currently experiencing to ensure a balanced load.

12.2.1 What is autoscaling?

Autoscaling is a process that can be used to scale up and down your compute resources based on the load they are currently experiencing to ensure a balanced load across your compute environment.

Important
Important

Autoscaling is only supported for KVM.

12.2.2 How does autoscaling work?

The monitoring service, monasca, monitors your infrastructure resources and generates alarms based on their state. The Orchestration service, heat, talks to the monasca API and offers the capability to templatize the existing monasca resources, which are the monasca Notification and monasca Alarm definition. heat can configure certain alarms for the infrastructure resources (compute instances and block storage volumes) it creates and can expect monasca to notify continuously if a certain evaluation pattern in an alarm definition is met.

For example, heat can tell monasca that it needs an alarm generated if the average CPU utilization of the compute instance in a scaling group goes beyond 90%.

As monasca continuously monitors all the resources in the cloud, if it happens to see a compute instance spiking above 90% load as configured by heat, it generates an alarm and in turn sends a notification to heat. Once heat is notified, it will execute an action that was preconfigured in the template. Commonly, this action will be a scale up to increase the number of compute instances to balance the load that is being taken by the compute instance scaling group.

monasca sends a notification every 60 seconds while the alarm is in the ALARM state.

12.2.3 Autoscaling template example

The following monasca alarm definition template snippet is an example of instructing monasca to generate an alarm if the average CPU utilization in a group of compute instances exceeds beyond 50%. If the alarm is triggered, it will invoke the up_notification webhook once the alarm evaluation expression is satisfied.

cpu_alarm_high:
  type: OS::monasca::AlarmDefinition
  properties:
    name: CPU utilization beyond 50 percent
    description: CPU utilization reached beyond 50 percent
    expression:
    str_replace:
    template: avg(cpu.utilization_perc{scale_group=scale_group_id}) > 50 times 3
    params:
    scale_group_id: {get_param: "OS::stack_id"}
    severity: high
    alarm_actions:
      - {get_resource: up_notification }

The following monasca notification template snippet is an example of creating a monasca notification resource that will be used by the alarm definition snippet to notify heat.

up_notification:
  type: OS::monasca::Notification
  properties:
    type: webhook
    address: {get_attr: [scale_up_policy, alarm_url]}

12.2.4 monasca Agent configuration options

There is a monasca Agent configuration option which controls the behavior around compute instance creation and the measurements being received from the compute instance.

The variable is monasca_libvirt_vm_probation which is set in the ~/openstack/my_cloud/config/nova/libvirt-monitoring.yml file. Here is a snippet of the file showing the description and variable:

# The period of time (in seconds) in which to suspend metrics from a
# newly-created VM. This is used to prevent creating and storing
# quickly-obsolete metrics in an environment with a high amount of instance
# churn (VMs created and destroyed in rapid succession).  Setting to 0
# disables VM probation and metrics will be recorded as soon as possible
# after a VM is created.  Decreasing this value in an environment with a high
# amount of instance churn can have a large effect on the total number of
# metrics collected and increase the amount of CPU, disk space and network
# bandwidth required for monasca. This value may need to be decreased if
# heat Autoscaling is in use so that heat knows that a new VM has been
# created and is handling some of the load.
monasca_libvirt_vm_probation: 300

The default value is 300. This is the time in seconds that a compute instance must live before the monasca libvirt agent plugin will send measurements for it. This is so that the monasca metrics database does not fill with measurements from short lived compute instances. However, this means that the monasca threshold engine will not see measurements from a newly created compute instance for at least five minutes on scale up. If the newly created compute instance is able to start handling the load in less than five minutes, then heat autoscaling may mistakenly create another compute instance since the alarm does not clear.

If the default monasca_libvirt_vm_probation turns out to be an issue, it can be lowered. However, that will affect all compute instances, not just ones used by heat autoscaling which can increase the number of measurements stored in monasca if there are many short lived compute instances. You should consider how often compute instances are created that live less than the new value of monasca_libvirt_vm_probation. If few, if any, compute instances live less than the value of monasca_libvirt_vm_probation, then this value can be decreased without causing issues. If many compute instances live less than the monasca_libvirt_vm_probation period, then decreasing monasca_libvirt_vm_probation can cause excessive disk, CPU and memory usage by monasca.

If you wish to change this value, follow these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Edit the monasca_libvirt_vm_probation value in this configuration file:

    ~/openstack/my_cloud/config/nova/libvirt-monitoring.yml
  3. Commit your changes to the local git:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add --all
    ardana > git commit -m "changing monasca Agent configuration option"
  4. Run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  5. Update your deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  6. Run this playbook to reconfigure the nova service and enact your changes:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts nova-reconfigure.yml

12.3 Orchestration Service support for LBaaS v2

In SUSE OpenStack Cloud, the Orchestration service provides support for LBaaS v2, which means users can create LBaaS v2 resources using Orchestration.

The OpenStack documentation for LBaaSv2 resource plugins is available at following locations.

12.3.1 Limitations

In order to avoid stack-create timeouts when using load balancers, it is recommended that no more than 100 load balancers be created at a time using stack-create loops. Larger numbers of load balancers could reach quotas and/or exhaust resources resulting in the stack create-timeout.

12.3.2 More Information

For more information on the neutron command-line interface (CLI) and load balancing, see the OpenStack networking command-line client reference: http://docs.openstack.org/cli-reference/content/neutronclient_commands.html

For more information on heat see: http://docs.openstack.org/developer/heat

13 Managing Monitoring, Logging, and Usage Reporting

Information about the monitoring, logging, and metering services included with your SUSE OpenStack Cloud.

13.1 Monitoring

The SUSE OpenStack Cloud Monitoring service leverages OpenStack monasca, which is a multi-tenant, scalable, fault tolerant monitoring service.

13.1.1 Getting Started with Monitoring

You can use the SUSE OpenStack Cloud Monitoring service to monitor the health of your cloud and, if necessary, to troubleshoot issues.

monasca data can be extracted and used for a variety of legitimate purposes, and different purposes require different forms of data sanitization or encoding to protect against invalid or malicious data. Any data pulled from monasca should be considered untrusted data, so users are advised to apply appropriate encoding and/or sanitization techniques to ensure safe and correct usage and display of data in a web browser, database scan, or any other use of the data.

13.1.1.1 Monitoring Service Overview

13.1.1.1.1 Installation

The monitoring service is automatically installed as part of the SUSE OpenStack Cloud installation.

No specific configuration is required to use monasca. However, you can configure the database for storing metrics as explained in Section 13.1.2, “Configuring the Monitoring Service”.

13.1.1.1.2 Differences Between Upstream and SUSE OpenStack Cloud Implementations

In SUSE OpenStack Cloud, the OpenStack monitoring service, monasca, is included as the monitoring solution, except for the following which are not included:

  • Transform Engine

  • Events Engine

  • Anomaly and Prediction Engine

Note
Note

Icinga was supported in previous SUSE OpenStack Cloud versions but it has been deprecated in SUSE OpenStack Cloud 9.

13.1.1.1.3 Diagram of monasca Service
Image
13.1.1.1.4 For More Information

For more details on OpenStack monasca, see monasca.io

13.1.1.1.5 Back-end Database

The monitoring service default metrics database is Cassandra, which is a highly-scalable analytics database and the recommended database for SUSE OpenStack Cloud.

You can learn more about Cassandra at Apache Cassandra.

13.1.1.2 Working with Monasca

monasca-Agent

The monasca-agent is a Python program that runs on the control plane nodes. It runs the defined checks and then sends data onto the API. The checks that the agent runs include:

  • System Metrics: CPU utilization, memory usage, disk I/O, network I/O, and filesystem utilization on the control plane and resource nodes.

  • Service Metrics: the agent supports plugins such as MySQL, RabbitMQ, Kafka, and many others.

  • VM Metrics: CPU utilization, disk I/O, network I/O, and memory usage of hosted virtual machines on compute nodes. Full details of these can be found https://github.com/openstack/monasca-agent/blob/master/docs/Plugins.md#per-instance-metrics.

For a full list of packaged plugins that are included SUSE OpenStack Cloud, see monasca Plugins

You can further customize the monasca-agent to suit your needs, see Customizing the Agent

13.1.1.3 Accessing the Monitoring Service

Access to the Monitoring service is available through a number of different interfaces.

13.1.1.3.1 Command-Line Interface

For users who prefer using the command line, there is the python-monascaclient, which is part of the default installation on your Cloud Lifecycle Manager node.

For details on the CLI, including installation instructions, see Python-monasca Client

monasca API

If low-level access is desired, there is the monasca REST API.

Full details of the monasca API can be found on GitHub.

13.1.1.3.2 Operations Console GUI

You can use the Operations Console (Ops Console) for SUSE OpenStack Cloud to view data about your SUSE OpenStack Cloud cloud infrastructure in a web-based graphical user interface (GUI) and ensure your cloud is operating correctly. By logging on to the console, SUSE OpenStack Cloud administrators can manage data in the following ways: Triage alarm notifications.

  • Alarm Definitions and notifications now have their own screens and are collected under the Alarm Explorer menu item which can be accessed from the Central Dashboard. Central Dashboard now allows you to customize the view in the following ways:

    • Rename or re-configure existing alarm cards to include services different from the defaults

    • Create a new alarm card with the services you want to select

    • Reorder alarm cards using drag and drop

    • View all alarms that have no service dimension now grouped in an Uncategorized Alarms card

    • View all alarms that have a service dimension that does not match any of the other cards -now grouped in an Other Alarms card

  • You can also easily access alarm data for a specific component. On the Summary page for the following components, a link is provided to an alarms screen specifically for that component.

13.1.1.3.3 Connecting to the Operations Console

To connect to Operations Console, perform the following:

  • Ensure your login has the required access credentials.

  • Connect through a browser.

  • Optionally use a Host name OR virtual IP address to access Operations Console.

Operations Console will always be accessed over port 9095.

13.1.1.4 Service Alarm Definitions

SUSE OpenStack Cloud comes with some predefined monitoring alarms for the services installed.

Full details of all service alarms can be found here: Section 18.1.1, “Alarm Resolution Procedures”.

Each alarm will have one of the following statuses:

  • Critical - Open alarms, identified by red indicator.

  • Warning - Open alarms, identified by yellow indicator.

  • Unknown - Open alarms, identified by gray indicator. Unknown will be the status of an alarm that has stopped receiving a metric. This can be caused by the following conditions:

    • An alarm exists for a service or component that is not installed in the environment.

    • An alarm exists for a virtual machine or node that previously existed but has been removed without the corresponding alarms being removed.

    • There is a gap between the last reported metric and the next metric.

  • Open - Complete list of open alarms.

  • Total - Complete list of alarms, may include Acknowledged and Resolved alarms.

When alarms are triggered it is helpful to review the service logs.

13.1.2 Configuring the Monitoring Service

The monitoring service, based on monasca, allows you to configure an external SMTP server for email notifications when alarms trigger. You also have options for your alarm metrics database should you choose not to use the default option provided with the product.

In SUSE OpenStack Cloud you have the option to specify a SMTP server for email notifications and a database platform you want to use for the metrics database. These steps will assist in this process.

13.1.2.1 Configuring the Monitoring Email Notification Settings

The monitoring service, based on monasca, allows you to configure an external SMTP server for email notifications when alarms trigger. In SUSE OpenStack Cloud, you have the option to specify a SMTP server for email notifications. These steps will assist in this process.

If you are going to use the email notifiication feature of the monitoring service, you must set the configuration options with valid email settings including an SMTP server and valid email addresses. The email server is not provided by SUSE OpenStack Cloud, but must be specified in the configuration file described below. The email server must support SMTP.

13.1.2.1.1 Configuring monitoring notification settings during initial installation
  1. Log in to the Cloud Lifecycle Manager.

  2. To change the SMTP server configuration settings edit the following file:

    ~/openstack/my_cloud/definition/cloudConfig.yml
    1. Enter your email server settings. Here is an example snippet showing the configuration file contents, uncomment these lines before entering your environment details.

          smtp-settings:
          #  server: mailserver.examplecloud.com
          #  port: 25
          #  timeout: 15
          # These are only needed if your server requires authentication
          #  user:
          #  password:

      This table explains each of these values:

      ValueDescription
      Server (required)

      The server entry must be uncommented and set to a valid hostname or IP Address.

      Port (optional)

      If your SMTP server is running on a port other than the standard 25, then uncomment the port line and set it your port.

      Timeout (optional)

      If your email server is heavily loaded, the timeout parameter can be uncommented and set to a larger value. 15 seconds is the default.

      User / Password (optional)

      If your SMTP server requires authentication, then you can configure user and password. Use double quotes around the password to avoid issues with special characters.

  3. To configure the sending email addresses, edit the following file:

    ~/openstack/ardana/ansible/roles/monasca-notification/defaults/main.yml

    Modify the following value to add your sending email address:

    email_from_addr
    Note
    Note

    The default value in the file is email_from_address: notification@exampleCloud.com which you should edit.

  4. [optional] To configure the receiving email addresses, edit the following file:

    ~/openstack/ardana/ansible/roles/monasca-default-alarms/defaults/main.yml

    Modify the following value to configure a receiving email address:

    notification_address
    Note
    Note

    You can also set the receiving email address via the Operations Console. Instructions for this are in the last section.

  5. If your environment requires a proxy address then you can add that in as well:

    # notification_environment can be used to configure proxies if needed.
    # Below is an example configuration. Note that all of the quotes are required.
    # notification_environment: '"http_proxy=http://<your_proxy>:<port>" "https_proxy=http://<your_proxy>:<port>"'
    notification_environment: ''
  6. Commit your configuration to the local Git repository (see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”), as follows:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "Updated monitoring service email notification settings"
  7. Continue with your installation.

13.1.2.1.2 Monasca and Apache Commons Validator

The monasca notification uses a standard Apache Commons validator to validate the configured SUSE OpenStack Cloud domain names before sending the notification over webhook. monasca notification supports some non-standard domain names, but not all. See the Domain Validator documentation for more information: https://commons.apache.org/proper/commons-validator/apidocs/org/apache/commons/validator/routines/DomainValidator.html

You should ensure that any domains that you use are supported by IETF and IANA. As an example, .local is not listed by IANA and is invalid but .gov and .edu are valid.

Failure to use supported domains will generate an unprocessable exception in monasca notification create:

HTTPException code=422 message={"unprocessable_entity":
{"code":422,"message":"Address https://myopenstack.sample:8000/v1/signal/test is not of correct format","details":"","internal_code":"c6cf9d9eb79c3fc4"}
13.1.2.1.3 Configuring monitoring notification settings after the initial installation

If you need to make changes to the email notification settings after your initial deployment, you can change the "From" address using the configuration files but the "To" address will need to be changed in the Operations Console. The following section will describe both of these processes.

To change the sending email address:

  1. Log in to the Cloud Lifecycle Manager.

  2. To configure the sending email addresses, edit the following file:

    ~/openstack/ardana/ansible/roles/monasca-notification/defaults/main.yml

    Modify the following value to add your sending email address:

    email_from_addr
    Note
    Note

    The default value in the file is email_from_address: notification@exampleCloud.com which you should edit.

  3. Commit your configuration to the local Git repository (Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”), as follows:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "Updated monitoring service email notification settings"
  4. Run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  5. Update your deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  6. Run the monasca reconfigure playbook to deploy the changes:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-reconfigure.yml --tags notification
    Note
    Note

    You may need to use the --ask-vault-pass switch if you opted for encryption during the initial deployment.

To change the receiving email address via the Operations Console:

To configure the "To" email address, after installation,

  1. Connect to and log in to the Operations Console.

  2. On the Home screen, click the menu represented by 3 horizontal lines (Three-Line Icon).

  3. From the menu that slides in on the left side, click Home, and then Alarm Explorer.

  4. On the Alarm Explorer page, at the top, click the Notification Methods text.

  5. On the Notification Methods page, find the row with the Default Email notification.

  6. In the Default Email row, click the details icon (Ellipsis Icon), then click Edit.

  7. On the Edit Notification Method: Default Email page, in Name, Type, and Address/Key, type in the values you want to use.

  8. On the Edit Notification Method: Default Email page, click Update Notification.

Important
Important

Once the notification has been added, using the procedures using the Ansible playbooks will not change it.

13.1.2.2 Managing Notification Methods for Alarms

13.1.2.2.1 Enabling a Proxy for Webhook or Pager Duty Notifications

If your environment requires a proxy in order for communications to function then these steps will show you how you can enable one. These steps will only be needed if you are utilizing the webhook or pager duty notification methods.

These steps will require access to the Cloud Lifecycle Manager in your cloud deployment so you may need to contact your Administrator. You can make these changes during the initial configuration phase prior to the first installation or you can modify your existing environment, the only difference being the last step.

  1. Log in to the Cloud Lifecycle Manager.

  2. Edit the ~/openstack/ardana/ansible/roles/monasca-notification/defaults/main.yml file and edit the line below with your proxy address values:

    notification_environment: '"http_proxy=http://<proxy_address>:<port>" "https_proxy=<http://proxy_address>:<port>"'
    Note
    Note

    There are single quotation marks around the entire value of this entry and then double quotation marks around the individual proxy entries. This formatting must exist when you enter these values into your configuration file.

  3. If you are making these changes prior to your initial installation then you are done and can continue on with the installation. However, if you are modifying an existing environment, you will need to continue on with the remaining steps below.

  4. Commit your configuration to the local Git repository (see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”), as follows:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  5. Run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  6. Generate an updated deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  7. Run the monasca reconfigure playbook to enable these changes:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-reconfigure.yml --tags notification
13.1.2.2.2 Creating a New Notification Method
  1. Log in to the Operations Console.

  2. Use the navigation menu to go to the Alarm Explorer page:

    Image
  3. Select the Notification Methods menu and then click the Create Notification Method button:

    Image
  4. On the Create Notification Method window you will select your options and then click the Create Notification button.

    Image

    A description of each of the fields you use for each notification method:

    FieldDescription
    Name

    Enter a unique name value for the notification method you are creating.

    Type

    Choose a type. Available values are Webhook, Email, or Pager Duty.

    Address/KeyEnter the value corresponding to the type you chose.
13.1.2.2.3 Applying a Notification Method to an Alarm Definition
  1. Log in to the Operations Console.

  2. Use the navigation menu to go to the Alarm Explorer page:

    Image
  3. Select the Alarm Definition menu which will give you a list of each of the alarm definitions in your environment.

    Image
  4. Locate the alarm you want to change the notification method for and click on its name to bring up the edit menu. You can use the sorting methods for assistance.

  5. In the edit menu, scroll down to the Notifications and Severity section where you will select one or more Notification Methods before selecting the Update Alarm Definition button:

    Image
  6. Repeat as needed until all of your alarms have the notification methods you desire.

13.1.2.3 Enabling the RabbitMQ Admin Console

The RabbitMQ Admin Console is off by default in SUSE OpenStack Cloud. You can turn on the console by following these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Edit the ~/openstack/my_cloud/config/rabbitmq/main.yml file. Under the rabbit_plugins:line, uncomment

    - rabbitmq_management
  3. Commit your configuration to the local Git repository (see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”), as follows:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "Enabled RabbitMQ Admin Console"
  4. Run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  5. Update your deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  6. Run the RabbitMQ reconfigure playbook to deploy the changes:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts rabbitmq-reconfigure.yml

To turn the RabbitMQ Admin Console off again, add the comment back and repeat steps 3 through 6.

13.1.2.4 Capacity Reporting and Monasca Transform

Capacity reporting is a new feature in SUSE OpenStack Cloud which will provide cloud operators overall capacity (available, used, and remaining) information via the Operations Console so that the cloud operator can ensure that cloud resource pools have sufficient capacity to meet the demands of users. The cloud operator is also able to set thresholds and set alarms to be notified when the thresholds are reached.

For Compute

  • Host Capacity - CPU/Disk/Memory: Used, Available and Remaining Capacity - for the entire cloud installation or by host

  • VM Capacity - CPU/Disk/Memory: Allocated, Available and Remaining - for the entire cloud installation, by host or by project

For Object Storage

  • Disk Capacity - Used, Available and Remaining Capacity - for the entire cloud installation or by project

In addition to overall capacity, roll up views with appropriate slices provide views by a particular project, or compute node. Graphs also show trends and the change in capacity over time.

13.1.2.4.1 monasca Transform Features
  • monasca Transform is a new component in monasca which transforms and aggregates metrics using Apache Spark

  • Aggregated metrics are published to Kafka and are available for other monasca components like monasca-threshold and are stored in monasca datastore

  • Cloud operators can set thresholds and set alarms to receive notifications when thresholds are met.

  • These aggregated metrics are made available to the cloud operators via Operations Console's new Capacity Summary (reporting) UI

  • Capacity reporting is a new feature in SUSE OpenStack Cloud which will provides cloud operators an overall capacity (available, used and remaining) for Compute and Object Storage

  • Cloud operators can look at Capacity reporting via Operations Console's Compute Capacity Summary and Object Storage Capacity Summary UI

  • Capacity reporting allows the cloud operators the ability to ensure that cloud resource pools have sufficient capacity to meet demands of users. See table below for Service and Capacity Types.

  • A list of aggregated metrics is provided in Section 13.1.2.4.4, “New Aggregated Metrics”.

  • Capacity reporting aggregated metrics are aggregated and published every hour

  • In addition to the overall capacity, there are graphs which show the capacity trends over time range (for 1 day, for 7 days, for 30 days or for 45 days)

  • Graphs showing the capacity trends by a particular project or compute host are also provided.

  • monasca Transform is integrated with centralized monitoring (monasca) and centralized logging

  • Flexible Deployment

  • Upgrade & Patch Support

ServiceType of CapacityDescription
ComputeHost Capacity

CPU/Disk/Memory: Used, Available and Remaining Capacity - for entire cloud installation or by compute host

 VM Capacity

CPU/Disk/Memory: Allocated, Available and Remaining - for entire cloud installation, by host or by project

Object StorageDisk Capacity

Used, Available and Remaining Disk Capacity - for entire cloud installation or by project

 Storage Capacity

Utilized Storage Capacity - for entire cloud installation or by project

13.1.2.4.2 Architecture for Monasca Transform and Spark

monasca Transform is a new component in monasca. monasca Transform uses Spark for data aggregation. Both monasca Transform and Spark are depicted in the example diagram below.

Image

You can see that the monasca components run on the Cloud Controller nodes, and the monasca agents run on all nodes in the Mid-scale Example configuration.

Image
13.1.2.4.3 Components for Capacity Reporting
13.1.2.4.3.1 monasca Transform: Data Aggregation Reporting

monasca-transform is a new component which provides mechanism to aggregate or transform metrics and publish new aggregated metrics to monasca.

monasca Transform is a data driven Apache Spark based data aggregation engine which collects, groups and aggregates existing individual monasca metrics according to business requirements and publishes new transformed (derived) metrics to the monasca Kafka queue.

Since the new transformed metrics are published as any other metric in monasca, alarms can be set and triggered on the transformed metric, just like any other metric.

13.1.2.4.3.2 Object Storage and Compute Capacity Summary Operations Console UI

A new "Capacity Summary" tab for Compute and Object Storage will displays all the aggregated metrics under the "Compute" and "Object Storage" sections.

Operations Console UI makes calls to monasca API to retrieve and display various tiles and graphs on Capacity Summary tab in Compute and Object Storage Summary UI pages.

13.1.2.4.3.3 Persist new metrics and Trigger Alarms

New aggregated metrics will be published to monasca's Kafka queue and will be ingested by monasca-persister. If thresholds and alarms have been set on the aggregated metrics, monasca will generate and trigger alarms as it currently does with any other metric. No new/additional change is expected with persisting of new aggregated metrics or setting threshold/alarms.

13.1.2.4.4 New Aggregated Metrics

Following is the list of aggregated metrics produced by monasca transform in SUSE OpenStack Cloud

Table 13.1: Aggregated Metrics
 Metric NameForDescriptionDimensionsNotes
1

cpu.utilized_logical_cores_agg

compute summary

utilized physical host cpu core capacity for one or all hosts by time interval (defaults to a hour)

aggregation_period: hourly

host: all or <host name>

project_id: all

Available as total or per host
2cpu.total_logical_cores_aggcompute summary

total physical host cpu core capacity for one or all hosts by time interval (defaults to a hour)

aggregation_period: hourly

host: all or <host name>

project_id: all

Available as total or per host
3mem.total_mb_aggcompute summary

total physical host memory capacity by time interval (defaults to a hour)

aggregation_period: hourly

host: all

project_id: all

 
4mem.usable_mb_aggcompute summary

usable physical host memory capacity by time interval (defaults to a hour)

aggregation_period: hourly

host: all

project_id: all

 
5disk.total_used_space_mb_aggcompute summary

utilized physical host disk capacity by time interval (defaults to a hour)

aggregation_period: hourly

host: all

project_id: all

 
6disk.total_space_mb_aggcompute summary

total physical host disk capacity by time interval (defaults to a hour)

aggregation_period: hourly

host: all

project_id: all

 
7nova.vm.cpu.total_allocated_aggcompute summary

cpus allocated across all VMs by time interval (defaults to a hour)

aggregation_period: hourly

host: all

project_id: all

 
8vcpus_aggcompute summary

virtual cpus allocated capacity for VMs of one or all projects by time interval (defaults to a hour)

aggregation_period: hourly

host: all

project_id: all or <project ID>

Available as total or per project
9nova.vm.mem.total_allocated_mb_aggcompute summary

memory allocated to all VMs by time interval (defaults to a hour)

aggregation_period: hourly

host: all

project_id: all

 
10vm.mem.used_mb_aggcompute summary

memory utilized by VMs of one or all projects by time interval (defaults to an hour)

aggregation_period: hourly

host: all

project_id: <project ID>

Available as total or per project
11vm.mem.total_mb_aggcompute summary

memory allocated to VMs of one or all projects by time interval (defaults to an hour)

aggregation_period: hourly

host: all

project_id: <project ID>

Available as total or per project
12vm.cpu.utilization_perc_aggcompute summary

cpu utilized by all VMs by project by time interval (defaults to an hour)

aggregation_period: hourly

host: all

project_id: <project ID>

 
13nova.vm.disk.total_allocated_gb_aggcompute summary

disk space allocated to all VMs by time interval (defaults to a hour)

aggregation_period: hourly

host: all

project_id: all

 
14vm.disk.allocation_aggcompute summary

disk allocation for VMs of one or all projects by time interval (defaults to a hour)

aggregation_period: hourly

host: all

project_id: all or <project ID>

Available as total or per project
15swiftlm.diskusage.val.size_aggobject storage summary

total available object storage capacity by time interval (defaults to a hour)

aggregation_period: hourly

host: all or <host name>

project_id: all

Available as total or per host
16swiftlm.diskusage.val.avail_aggobject storage summary

remaining object storage capacity by time interval (defaults to a hour)

aggregation_period: hourly

host: all or <host name>

project_id: all

Available as total or per host
17swiftlm.diskusage.rate_aggobject storage summary

rate of change of object storage usage by time interval (defaults to a hour)

aggregation_period: hourly

host: all

project_id: all

 
18storage.objects.size_aggobject storage summary

used object storage capacity by time interval (defaults to a hour)

aggregation_period: hourly

host: all

project_id: all

 
13.1.2.4.5 Deployment

monasca Transform and Spark will be deployed on the same control plane nodes along with Logging and Monitoring Service (monasca).

Security Consideration during deployment of monasca Transform and Spark

The SUSE OpenStack Cloud Monitoring system connects internally to the Kafka and Spark technologies without authentication. If you choose to deploy Monitoring, configure it to use only trusted networks such as the Management network, as illustrated on the network diagrams below for Entry Scale Deployment and Mid Scale Deployment.

Entry Scale Deployment

In Entry Scale Deployment monasca Transform and Spark will be deployed on Shared Control Plane along with other Openstack Services along with Monitoring and Logging

Image

Mid scale Deployment

In a Mid Scale Deployment monasca Transform and Spark will be deployed on dedicated Metering Monitoring and Logging (MML) control plane along with other data processing intensive services like Metering, Monitoring and Logging

Image

Multi Control Plane Deployment

In a Multi Control Plane Deployment, monasca Transform and Spark will be deployed on the Shared Control plane along with rest of monasca Components.

Start, Stop and Status for monasca Transform and Spark processes

The service management methods for monasca-transform and spark follow the convention for services in the OpenStack platform. When executing from the deployer node, the commands are as follows:

Status

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts spark-status.yml
ardana > ansible-playbook -i hosts/verb_hosts monasca-transform-status.yml

Start

As monasca-transform depends on spark for the processing of the metrics spark will need to be started before monasca-transform.

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts spark-start.yml
ardana > ansible-playbook -i hosts/verb_hosts monasca-transform-start.yml

Stop

As a precaution, stop the monasca-transform service before taking spark down. Interruption to the spark service altogether while monasca-transform is still running can result in a monasca-transform process that is unresponsive and needing to be tidied up.

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts monasca-transform-stop.yml
ardana > ansible-playbook -i hosts/verb_hosts spark-stop.yml
13.1.2.4.6 Reconfigure

The reconfigure process can be triggered again from the deployer. Presuming that changes have been made to the variables in the appropriate places execution of the respective ansible scripts will be enough to update the configuration. The spark reconfigure process alters the nodes serially meaning that spark is never down altogether, each node is stopped in turn and zookeeper manages the leaders accordingly. This means that monasca-transform may be left running even while spark is upgraded.

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts spark-reconfigure.yml
ardana > ansible-playbook -i hosts/verb_hosts monasca-transform-reconfigure.yml
13.1.2.4.7 Adding monasca Transform and Spark to SUSE OpenStack Cloud Deployment

Since monasca Transform and Spark are optional components, the users might elect to not install these two components during their initial SUSE OpenStack Cloud install. The following instructions provide a way the users can add monasca Transform and Spark to their existing SUSE OpenStack Cloud deployment.

Steps

  1. Add monasca Transform and Spark to the input model. monasca Transform and Spark on a entry level cloud would be installed on the common control plane, for mid scale cloud which has a MML (Metering, Monitoring and Logging) cluster, monasca Transform and Spark will should be added to MML cluster.

    ardana > cd ~/openstack/my_cloud/definition/data/

    Add spark and monasca-transform to input model, control_plane.yml

    clusters
           - name: core
             cluster-prefix: c1
             server-role: CONTROLLER-ROLE
             member-count: 3
             allocation-policy: strict
             service-components:
    
               [...]
    
               - zookeeper
               - kafka
               - cassandra
               - storm
               - spark
               - monasca-api
               - monasca-persister
               - monasca-notifier
               - monasca-threshold
               - monasca-client
               - monasca-transform
    
               [...]
  2. Run the Configuration Processor

    ardana > cd ~/openstack/my_cloud/definition
    ardana > git add -A
    ardana > git commit -m "Adding monasca Transform and Spark"
    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  3. Run Ready Deployment

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  4. Run Cloud Lifecycle Manager Deploy

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-deploy.yml

Verify Deployment

Login to each controller node and run

tux > sudo service monasca-transform status
tux > sudo service spark-master status
tux > sudo service spark-worker status
tux > sudo service monasca-transform status
● monasca-transform.service - monasca Transform Daemon
  Loaded: loaded (/etc/systemd/system/monasca-transform.service; disabled)
  Active: active (running) since Wed 2016-08-24 00:47:56 UTC; 2 days ago
Main PID: 7351 (bash)
  CGroup: /system.slice/monasca-transform.service
          ├─ 7351 bash /etc/monasca/transform/init/start-monasca-transform.sh
          ├─ 7352 /opt/stack/service/monasca-transform/venv//bin/python /opt/monasca/monasca-transform/lib/service_runner.py
          ├─27904 /bin/sh -c export SPARK_HOME=/opt/stack/service/spark/venv/bin/../current && spark-submit --supervise --master spark://omega-cp1-c1-m1-mgmt:7077,omega-cp1-c1-m2-mgmt:7077,omega-cp1-c1...
          ├─27905 /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp /opt/stack/service/spark/venv/lib/drizzle-jdbc-1.3.jar:/opt/stack/service/spark/venv/bin/../current/conf/:/opt/stack/service/spark/v...
          └─28355 python /opt/monasca/monasca-transform/lib/driver.py
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.


tux > sudo service spark-worker status
● spark-worker.service - Spark Worker Daemon
  Loaded: loaded (/etc/systemd/system/spark-worker.service; disabled)
  Active: active (running) since Wed 2016-08-24 00:46:05 UTC; 2 days ago
Main PID: 63513 (bash)
  CGroup: /system.slice/spark-worker.service
          ├─ 7671 python -m pyspark.daemon
          ├─28948 /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp /opt/stack/service/spark/venv/bin/../current/conf/:/opt/stack/service/spark/venv/bin/../current/lib/spark-assembly-1.6.1-hadoop2.6.0...
          ├─63513 bash /etc/spark/init/start-spark-worker.sh &
          └─63514 /usr/bin/java -cp /opt/stack/service/spark/venv/bin/../current/conf/:/opt/stack/service/spark/venv/bin/../current/lib/spark-assembly-1.6.1-hadoop2.6.0.jar:/opt/stack/service/spark/ven...
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.



tux > sudo service spark-master status
● spark-master.service - Spark Master Daemon
  Loaded: loaded (/etc/systemd/system/spark-master.service; disabled)
  Active: active (running) since Wed 2016-08-24 00:44:24 UTC; 2 days ago
Main PID: 55572 (bash)
  CGroup: /system.slice/spark-master.service
          ├─55572 bash /etc/spark/init/start-spark-master.sh &
          └─55573 /usr/bin/java -cp /opt/stack/service/spark/venv/bin/../current/conf/:/opt/stack/service/spark/venv/bin/../current/lib/spark-assembly-1.6.1-hadoop2.6.0.jar:/opt/stack/service/spark/ven...
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
13.1.2.4.8 Increase monasca Transform Scale

monasca Transform in the default configuration can scale up to estimated data for 100 node cloud deployment. Estimated maximum rate of metrics from a 100 node cloud deployment is 120M/hour.

You can further increase the processing rate to 180M/hour. Making the Spark configuration change will increase the CPU's being used by Spark and monasca Transform from average of around 3.5 to 5.5 CPU's per control node over a 10 minute batch processing interval.

To increase the processing rate to 180M/hour the customer will have to make following spark configuration change.

Steps

  1. Edit /var/lib/ardana/openstack/my_cloud/config/spark/spark-defaults.conf.j2 and set spark.cores.max to 6 and spark.executor.cores 2

    Set spark.cores.max to 6

    spark.cores.max {{ spark_cores_max }}

    to

    spark.cores.max 6

    Set spark.executor.cores to 2

    spark.executor.cores {{ spark_executor_cores }}

    to

    spark.executor.cores 2
  2. Edit ~/openstack/my_cloud/config/spark/spark-env.sh.j2

    Set SPARK_WORKER_CORES to 2

    export SPARK_WORKER_CORES={{ spark_worker_cores }}

    to

    export SPARK_WORKER_CORES=2
  3. Edit ~/openstack/my_cloud/config/spark/spark-worker-env.sh.j2

    Set SPARK_WORKER_CORES to 2

    export SPARK_WORKER_CORES={{ spark_worker_cores }}

    to

    export SPARK_WORKER_CORES=2
  4. Run Configuration Processor

    ardana > cd ~/openstack/my_cloud/definition
    ardana > git add -A
    ardana > git commit -m "Changing Spark Config increase scale"
    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  5. Run Ready Deployment

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  6. Run spark-reconfigure.yml and monasca-transform-reconfigure.yml

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts spark-reconfigure.yml
    ardana > ansible-playbook -i hosts/verb_hosts monasca-transform-reconfigure.yml
13.1.2.4.9 Change Compute Host Pattern Filter in Monasca Transform

monasca Transform identifies compute host metrics by pattern matching on hostname dimension in the incoming monasca metrics. The default pattern is of the form compNNN. For example, comp001, comp002, etc. To filter for it in the transformation specs, use the expression -comp[0-9]+-. In case the compute host names follow a different pattern other than the standard pattern above, the filter by expression when aggregating metrics will have to be changed.

Steps

  1. On the deployer: Edit ~/openstack/my_cloud/config/monasca-transform/transform_specs.json.j2

  2. Look for all references of -comp[0-9]+- and change the regular expression to the desired pattern say for example -compute[0-9]+-.

    {"aggregation_params_map":{"aggregation_pipeline":{"source":"streaming", "usage":"fetch_quantity", "setters":["rollup_quantity", "set_aggregated_metric_name", "set_aggregated_period"], "insert":["prepare_data","insert_data_pre_hourly"]}, "aggregated_metric_name":"mem.total_mb_agg", "aggregation_period":"hourly", "aggregation_group_by_list": ["host", "metric_id", "tenant_id"], "usage_fetch_operation": "avg", "filter_by_list": [{"field_to_filter": "host", "filter_expression": "-comp[0-9]+", "filter_operation": "include"}], "setter_rollup_group_by_list":[], "setter_rollup_operation": "sum", "dimension_list":["aggregation_period", "host", "project_id"], "pre_hourly_operation":"avg", "pre_hourly_group_by_list":["default"]}, "metric_group":"mem_total_all", "metric_id":"mem_total_all"}

    to

    {"aggregation_params_map":{"aggregation_pipeline":{"source":"streaming", "usage":"fetch_quantity", "setters":["rollup_quantity", "set_aggregated_metric_name", "set_aggregated_period"], "insert":["prepare_data", "insert_data_pre_hourly"]}, "aggregated_metric_name":"mem.total_mb_agg", "aggregation_period":"hourly", "aggregation_group_by_list": ["host", "metric_id", "tenant_id"],"usage_fetch_operation": "avg","filter_by_list": [{"field_to_filter": "host","filter_expression": "-compute[0-9]+", "filter_operation": "include"}], "setter_rollup_group_by_list":[], "setter_rollup_operation": "sum", "dimension_list":["aggregation_period", "host", "project_id"], "pre_hourly_operation":"avg", "pre_hourly_group_by_list":["default"]}, "metric_group":"mem_total_all", "metric_id":"mem_total_all"}
    Note
    Note

    The filter_expression has been changed to the new pattern.

  3. To change all host metric transformation specs in the same JSON file, repeat Step 2.

    Transformation specs will have to be changed for following metric_ids namely "mem_total_all", "mem_usable_all", "disk_total_all", "disk_usable_all", "cpu_total_all", "cpu_total_host", "cpu_util_all", "cpu_util_host"

  4. Run the Configuration Processor:

    ardana > cd ~/openstack/my_cloud/definition
    ardana > git add -A
    ardana > git commit -m "Changing monasca Transform specs"
    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  5. Run Ready Deployment:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  6. Run monasca Transform Reconfigure:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-transform-reconfigure.yml

13.1.2.5 Configuring Availability of Alarm Metrics

Using the monasca agent tuning knobs, you can choose which alarm metrics are available in your environment.

The addition of the libvirt and OVS plugins to the monasca agent provides a number of additional metrics that can be used. Most of these metrics are included by default, but others are not. You have the ability to use tuning knobs to add or remove these metrics to your environment based on your individual needs in your cloud.

We will list these metrics along with the tuning knob name and instructions for how to adjust these.

13.1.2.5.1 Libvirt plugin metric tuning knobs

The following metrics are added as part of the libvirt plugin:

Note
Note

For a description of each of these metrics, see Section 13.1.4.11, “Libvirt Metrics”.

Tuning KnobDefault SettingAdmin Metric NameProject Metric Name
vm_cpu_check_enableTruevm.cpu.time_nscpu.time_ns
vm.cpu.utilization_norm_perccpu.utilization_norm_perc
vm.cpu.utilization_perccpu.utilization_perc
vm_disks_check_enable

True

Creates 20 disk metrics per disk device per virtual machine.

vm.io.errorsio.errors
vm.io.errors_secio.errors_sec
vm.io.read_bytesio.read_bytes
vm.io.read_bytes_secio.read_bytes_sec
vm.io.read_opsio.read_ops
vm.io.read_ops_secio.read_ops_sec
vm.io.write_bytesio.write_bytes
vm.io.write_bytes_secio.write_bytes_sec
vm.io.write_opsio.write_ops
vm.io.write_ops_sec io.write_ops_sec
vm_network_check_enable

True

Creates 16 network metrics per NIC per virtual machine.

vm.net.in_bytesnet.in_bytes
vm.net.in_bytes_secnet.in_bytes_sec
vm.net.in_packetsnet.in_packets
vm.net.in_packets_secnet.in_packets_sec
vm.net.out_bytesnet.out_bytes
vm.net.out_bytes_secnet.out_bytes_sec
vm.net.out_packetsnet.out_packets
vm.net.out_packets_secnet.out_packets_sec
vm_ping_check_enableTruevm.ping_statusping_status
vm_extended_disks_check_enable

True

Creates 6 metrics per device per virtual machine.

vm.disk.allocationdisk.allocation
vm.disk.capacitydisk.capacity
vm.disk.physicaldisk.physical

True

Creates 6 aggregate metrics per virtual machine.

vm.disk.allocation_totaldisk.allocation_total
vm.disk.capacity_totaldisk.capacity.total
vm.disk.physical_totaldisk.physical_total
vm_disks_check_enable vm_extended_disks_check_enable

True

Creates 20 aggregate metrics per virtual machine.

vm.io.errors_totalio.errors_total
vm.io.errors_total_secio.errors_total_sec
vm.io.read_bytes_totalio.read_bytes_total
vm.io.read_bytes_total_secio.read_bytes_total_sec
vm.io.read_ops_totalio.read_ops_total
vm.io.read_ops_total_secio.read_ops_total_sec
vm.io.write_bytes_totalio.write_bytes_total
vm.io.write_bytes_total_secio.write_bytes_total_sec
vm.io.write_ops_totalio.write_ops_total
vm.io.write_ops_total_secio.write_ops_total_sec
13.1.2.5.1.1 Configuring the libvirt metrics using the tuning knobs

Use the following steps to configure the tuning knobs for the libvirt plugin metrics.

  1. Log in to the Cloud Lifecycle Manager.

  2. Edit the following file:

    ~/openstack/my_cloud/config/nova/libvirt-monitoring.yml
  3. Change the value for each tuning knob to the desired setting, True if you want the metrics created and False if you want them removed. Refer to the table above for which metrics are controlled by each tuning knob.

    vm_cpu_check_enable: <true or false>
    vm_disks_check_enable: <true or false>
    vm_extended_disks_check_enable: <true or false>
    vm_network_check_enable: <true or false>
    vm_ping_check_enable: <true or false>
  4. Commit your configuration to the local Git repository (see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”), as follows:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "configuring libvirt plugin tuning knobs"
  5. Update your deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  6. Run the nova reconfigure playbook to implement the changes:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts nova-reconfigure.yml
Note
Note

If you modify either of the following files, then the monasca tuning parameters should be adjusted to handle a higher load on the system.

~/openstack/my_cloud/config/nova/libvirt-monitoring.yml
~/openstack/my_cloud/config/neutron/monasca_ovs_plugin.yaml.j2

Tuning parameters are located in ~/openstack/my_cloud/config/monasca/configuration.yml. The parameter monasca_tuning_selector_override should be changed to the extra-large setting.

13.1.2.5.2 OVS plugin metric tuning knobs

The following metrics are added as part of the OVS plugin:

Note
Note

For a description of each of these metrics, see Section 13.1.4.16, “Open vSwitch (OVS) Metrics”.

Tuning KnobDefault SettingAdmin Metric NameProject Metric Name
use_rate_metricsFalseovs.vrouter.in_bytes_secvrouter.in_bytes_sec
ovs.vrouter.in_packets_secvrouter.in_packets_sec
ovs.vrouter.out_bytes_secvrouter.out_bytes_sec
ovs.vrouter.out_packets_secvrouter.out_packets_sec
use_absolute_metricsTrueovs.vrouter.in_bytesvrouter.in_bytes
ovs.vrouter.in_packetsvrouter.in_packets
ovs.vrouter.out_bytesvrouter.out_bytes
ovs.vrouter.out_packetsvrouter.out_packets
use_health_metrics with use_rate_metricsFalseovs.vrouter.in_dropped_secvrouter.in_dropped_sec
ovs.vrouter.in_errors_secvrouter.in_errors_sec
ovs.vrouter.out_dropped_secvrouter.out_dropped_sec
ovs.vrouter.out_errors_secvrouter.out_errors_sec
use_health_metrics with use_absolute_metricsFalseovs.vrouter.in_droppedvrouter.in_dropped
ovs.vrouter.in_errorsvrouter.in_errors
ovs.vrouter.out_droppedvrouter.out_dropped
ovs.vrouter.out_errorsvrouter.out_errors
13.1.2.5.2.1 Configuring the OVS metrics using the tuning knobs

Use the following steps to configure the tuning knobs for the libvirt plugin metrics.

  1. Log in to the Cloud Lifecycle Manager.

  2. Edit the following file:

    ~/openstack/my_cloud/config/neutron/monasca_ovs_plugin.yaml.j2
  3. Change the value for each tuning knob to the desired setting, True if you want the metrics created and False if you want them removed. Refer to the table above for which metrics are controlled by each tuning knob.

    init_config:
       use_absolute_metrics: <true or false>
       use_rate_metrics: <true or false>
       use_health_metrics: <true or false>
  4. Commit your configuration to the local Git repository (see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”), as follows:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "configuring OVS plugin tuning knobs"
  5. Update your deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  6. Run the neutron reconfigure playbook to implement the changes:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts neutron-reconfigure.yml

13.1.3 Integrating HipChat, Slack, and JIRA

monasca, the SUSE OpenStack Cloud monitoring and notification service, includes three default notification methods, email, PagerDuty, and webhook. monasca also supports three other notification plugins which allow you to send notifications to HipChat, Slack, and JIRA. Unlike the default notification methods, the additional notification plugins must be manually configured.

This guide details the steps to configure each of the three non-default notification plugins. This guide also assumes that your cloud is fully deployed and functional.

13.1.3.1 Configuring the HipChat Plugin

To configure the HipChat plugin you will need the following four pieces of information from your HipChat system.

  • The URL of your HipChat system.

  • A token providing permission to send notifications to your HipChat system.

  • The ID of the HipChat room you wish to send notifications to.

  • A HipChat user account. This account will be used to authenticate any incoming notifications from your SUSE OpenStack Cloud cloud.

Obtain a token

Use the following instructions to obtain a token from your Hipchat system.

  1. Log in to HipChat as the user account that will be used to authenticate the notifications.

  2. Navigate to the following URL: https://<your_hipchat_system>/account/api. Replace <your_hipchat_system> with the fully-qualified-domain-name of your HipChat system.

  3. Select the Create token option. Ensure that the token has the "SendNotification" attribute.

Obtain a room ID

Use the following instructions to obtain the ID of a HipChat room.

  1. Log in to HipChat as the user account that will be used to authenticate the notifications.

  2. Select My account from the application menu.

  3. Select the Rooms tab.

  4. Select the room that you want your notifications sent to.

  5. Look for the API ID field in the room information. This is the room ID.

Create HipChat notification type

Use the following instructions to create a HipChat notification type.

  1. Begin by obtaining the API URL for the HipChat room that you wish to send notifications to. The format for a URL used to send notifications to a room is as follows:

    /v2/room/{room_id_or_name}/notification

  2. Use the monasca API to create a new notification method. The following example demonstrates how to create a HipChat notification type named MyHipChatNotification, for room ID 13, using an example API URL and auth token.

    ardana > monasca notification-create  NAME TYPE ADDRESS
    ardana > monasca notification-create  MyHipChatNotification HIPCHAT https://hipchat.hpe.net/v2/room/13/notification?auth_token=1234567890

    The preceding example creates a notification type with the following characteristics

    • NAME: MyHipChatNotification

    • TYPE: HIPCHAT

    • ADDRESS: https://hipchat.hpe.net/v2/room/13/notification

    • auth_token: 1234567890

Note
Note

The horizon dashboard can also be used to create a HipChat notification type.

13.1.3.2 Configuring the Slack Plugin

Configuring a Slack notification type requires four pieces of information from your Slack system.

  • Slack server URL

  • Authentication token

  • Slack channel

  • A Slack user account. This account will be used to authenticate incoming notifications to Slack.

Identify a Slack channel

  1. Log in to your Slack system as the user account that will be used to authenticate the notifications to Slack.

  2. In the left navigation panel, under the CHANNELS section locate the channel that you wish to receive the notifications. The instructions that follow will use the example channel #general.

Create a Slack token

  1. Log in to your Slack system as the user account that will be used to authenticate the notifications to Slack

  2. Navigate to the following URL: https://api.slack.com/docs/oauth-test-tokens

  3. Select the Create token button.

Create a Slack notification type

  1. Begin by identifying the structure of the API call to be used by your notification method. The format for a call to the Slack Web API is as follows:

    https://slack.com/api/METHOD

    You can authenticate a Web API request by using the token that you created in the previous Create a Slack Tokensection. Doing so will result in an API call that looks like the following.

    https://slack.com/api/METHOD?token=auth_token

    You can further refine your call by specifying the channel that the message will be posted to. Doing so will result in an API call that looks like the following.

    https://slack.com/api/METHOD?token=AUTH_TOKEN&channel=#channel

    The following example uses the chat.postMessage method, the token 1234567890, and the channel #general.

    https://slack.com/api/chat.postMessage?token=1234567890&channel=#general

    Find more information on the Slack Web API here: https://api.slack.com/web

  2. Use the CLI on your Cloud Lifecycle Manager to create a new Slack notification type, using the API call that you created in the preceding step. The following example creates a notification type named MySlackNotification, using token 1234567890, and posting to channel #general.

    ardana > monasca notification-create  MySlackNotification SLACK https://slack.com/api/chat.postMessage?token=1234567890&channel=#general
Note
Note

Notification types can also be created in the horizon dashboard.

13.1.3.3 Configuring the JIRA Plugin

Configuring the JIRA plugin requires three pieces of information from your JIRA system.

  • The URL of your JIRA system.

  • Username and password of a JIRA account that will be used to authenticate the notifications.

  • The name of the JIRA project that the notifications will be sent to.

Create JIRA notification type

You will configure the monasca service to send notifications to a particular JIRA project. You must also configure JIRA to create new issues for each notification it receives to this project, however, that configuration is outside the scope of this document.

The monasca JIRA notification plugin supports only the following two JIRA issue fields.

  • PROJECT. This is the only supported mandatory JIRA issue field.

  • COMPONENT. This is the only supported optional JIRA issue field.

The JIRA issue type that your notifications will create may only be configured with the "Project" field as mandatory. If your JIRA issue type has any other mandatory fields, the monasca plugin will not function correctly. Currently, the monasca plugin only supports the single optional "component" field.

Creating the JIRA notification type requires a few more steps than other notification types covered in this guide. Because the Python and YAML files for this notification type are not yet included in SUSE OpenStack Cloud 9, you must perform the following steps to manually retrieve and place them on your Cloud Lifecycle Manager.

  1. Configure the JIRA plugin by adding the following block to the /etc/monasca/notification.yaml file, under the notification_types section, and adding the username and password of the JIRA account used for the notifications to the respective sections.

        plugins:
    
         - monasca_notification.plugins.jira_notifier:JiraNotifier
    
        jira:
            user:
    
            password:
    
            timeout: 60

    After adding the necessary block, the notification_types section should look like the following example. Note that you must also add the username and password for the JIRA user related to the notification type.

    notification_types:
        plugins:
    
         - monasca_notification.plugins.jira_notifier:JiraNotifier
    
        jira:
            user:
    
            password:
    
            timeout: 60
    
        webhook:
            timeout: 5
    
        pagerduty:
            timeout: 5
    
            url: "https://events.pagerduty.com/generic/2010-04-15/create_event.json"
  2. Create the JIRA notification type. The following command example creates a JIRA notification type named MyJiraNotification, in the JIRA project HISO.

    ardana > monasca notification-create  MyJiraNotification JIRA https://jira.hpcloud.net/?project=HISO

    The following command example creates a JIRA notification type named MyJiraNotification, in the JIRA project HISO, and adds the optional component field with a value of keystone.

    ardana > monasca notification-create MyJiraNotification JIRA https://jira.hpcloud.net/?project=HISO&component=keystone
    Note
    Note

    There is a slash (/) separating the URL path and the query string. The slash is required if you have a query parameter without a path parameter.

    Note
    Note

    Notification types may also be created in the horizon dashboard.

13.1.4 Alarm Metrics

You can use the available metrics to create custom alarms to further monitor your cloud infrastructure and facilitate autoscaling features.

For details on how to create customer alarms using the Operations Console, see Section 16.2, “Alarm Definition”.

13.1.4.1 Apache Metrics

A list of metrics associated with the Apache service.

Metric NameDimensionsDescription
apache.net.hits
hostname
service=apache
component=apache
Total accesses
apache.net.kbytes_sec
hostname
service=apache
component=apache
Total Kbytes per second
apache.net.requests_sec
hostname
service=apache
component=apache
Total accesses per second
apache.net.total_kbytes
hostname
service=apache
component=apache
Total Kbytes
apache.performance.busy_worker_count
hostname
service=apache
component=apache
The number of workers serving requests
apache.performance.cpu_load_perc
hostname
service=apache
component=apache

The current percentage of CPU used by each worker and in total by all workers combined

apache.performance.idle_worker_count
hostname
service=apache
component=apache
The number of idle workers
apache.status
apache_port
hostname
service=apache
component=apache
Status of Apache port

13.1.4.2 ceilometer Metrics

A list of metrics associated with the ceilometer service.

Metric NameDimensionsDescription
disk.total_space_mb_agg
aggregation_period=hourly,
host=all,
project_id=all
Total space of disk
disk.total_used_space_mb_agg
aggregation_period=hourly,
host=all,
project_id=all
Total used space of disk
swiftlm.diskusage.rate_agg
aggregation_period=hourly,
host=all,
project_id=all
 
swiftlm.diskusage.val.avail_agg
aggregation_period=hourly,
host,
project_id=all
 
swiftlm.diskusage.val.size_agg
aggregation_period=hourly,
host,
project_id=all
 
image
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=image,
source=openstack
Existence of the image
image.delete
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=image,
source=openstack
Delete operation on this image
image.size
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=B,
source=openstack
Size of the uploaded image
image.update
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=image,
source=openstack
Update operation on this image
image.upload
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=image,
source=openstack
Upload operation on this image
instance
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=instance,
source=openstack
Existence of instance
disk.ephemeral.size
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=GB,
source=openstack
Size of ephemeral disk on this instance
disk.root.size
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=GB,
source=openstack
Size of root disk on this instance
memory
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=MB,
source=openstack
Size of memory on this instance
ip.floating
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=ip,
source=openstack
Existence of IP
ip.floating.create
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=delta,
unit=ip,
source=openstack
Create operation on this fip
ip.floating.update
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=delta,
unit=ip,
source=openstack
Update operation on this fip
mem.total_mb_agg
aggregation_period=hourly,
host=all,
project_id=all
Total space of memory
mem.usable_mb_agg
aggregation_period=hourly,
host=all,
project_id=all
Available space of memory
network
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=network,
source=openstack
Existence of network
network.create
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=delta,
unit=network,
source=openstack
Create operation on this network
network.update
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=delta,
unit=network,
source=openstack
Update operation on this network
network.delete
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=delta,
unit=network,
source=openstack
Delete operation on this network
port
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=port,
source=openstack
Existence of port
port.create
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=delta,
unit=port,
source=openstack
Create operation on this port
port.delete
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=delta,
unit=port,
source=openstack
Delete operation on this port
port.update
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=delta,
unit=port,
source=openstack
Update operation on this port
router
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=router,
source=openstack
Existence of router
router.create
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=delta,
unit=router,
source=openstack
Create operation on this router
router.delete
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=delta,
unit=router,
source=openstack
Delete operation on this router
router.update
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=delta,
unit=router,
source=openstack
Update operation on this router
snapshot
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=snapshot,
source=openstack
Existence of the snapshot
snapshot.create.end
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=delta,
unit=snapshot,
source=openstack
Create operation on this snapshot
snapshot.delete.end
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=delta,
unit=snapshot,
source=openstack
Delete operation on this snapshot
snapshot.size
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=GB,
source=openstack
Size of this snapshot
subnet
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=subnet,
source=openstack
Existence of the subnet
subnet.create
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=delta,
unit=subnet,
source=openstack
Create operation on this subnet
subnet.delete
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=delta,
unit=subnet,
source=openstack
Delete operation on this subnet
subnet.update
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=delta,
unit=subnet,
source=openstack
Update operation on this subnet
vcpus
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=vcpus,
source=openstack
Number of virtual CPUs allocated to the instance
vcpus_agg
aggregation_period=hourly,
host=all,
project_id
Number of vcpus used by a project
volume
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=volume,
source=openstack
Existence of the volume
volume.create.end
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=delta,
unit=volume,
source=openstack
Create operation on this volume
volume.delete.end
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=delta,
unit=volume,
source=openstack
Delete operation on this volume
volume.resize.end
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=delta,
unit=volume,
source=openstack
Resize operation on this volume
volume.size
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=GB,
source=openstack
Size of this volume
volume.update.end
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=delta,
unit=volume,
source=openstack
Update operation on this volume
storage.objects
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=object,
source=openstack
Number of objects
storage.objects.size
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=B,
source=openstack
Total size of stored objects
storage.objects.containers
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=container,
source=openstack
Number of containers

13.1.4.3 cinder Metrics

A list of metrics associated with the cinder service.

Metric NameDimensionsDescription
cinderlm.cinder.backend.physical.list

service=block-storage, hostname, cluster, cloud_name, control_plane, component, backends

List of physical backends
cinderlm.cinder.backend.total.avail

service=block-storage, hostname, cluster, cloud_name, control_plane, component, backendname

Total available capacity metric per backend
cinderlm.cinder.backend.total.size

service=block-storage, hostname, cluster, cloud_name, control_plane, component, backendname

Total capacity metric per backend
cinderlm.cinder.cinder_services

service=block-storage, hostname, cluster, cloud_name, control_plane, component

Status of a cinder-volume service
cinderlm.hp_hardware.hpssacli.logical_drive

service=block-storage, hostname, cluster, cloud_name, control_plane, component, sub_component, logical_drive, controller_slot, array

The HPE Smart Storage Administrator (HPE SSA) CLI component will have to be installed for SSACLI status to be reported. To download and install the SSACLI utility to enable management of disk controllers, please refer to: https://support.hpe.com/hpsc/swd/public/detail?swItemId=MTX_3d16386b418a443388c18da82f

Status of a logical drive
cinderlm.hp_hardware.hpssacli.physical_drive

service=block-storage, hostname, cluster, cloud_name, control_plane, component, box, bay, controller_slot

Status of a logical drive
cinderlm.hp_hardware.hpssacli.smart_array

service=block-storage, hostname, cluster, cloud_name, control_plane, component, sub_component, model

Status of smart array
cinderlm.hp_hardware.hpssacli.smart_array.firmware

service=block-storage, hostname, cluster, cloud_name, control_plane, component, model

Checks firmware version

13.1.4.4 Compute Metrics

Note
Note

Compute instance metrics are listed in Section 13.1.4.11, “Libvirt Metrics”.

A list of metrics associated with the Compute service.

Metric NameDimensionsDescription
nova.heartbeat
service=compute
cloud_name
hostname
component
control_plane
cluster

Checks that all services are running heartbeats (uses nova user and to list services then sets up checks for each. For example, nova-scheduler, nova-conductor, nova-compute)

nova.vm.cpu.total_allocated
service=compute
hostname
component
control_plane
cluster
Total CPUs allocated across all VMs
nova.vm.disk.total_allocated_gb
service=compute
hostname
component
control_plane
cluster
Total Gbytes of disk space allocated to all VMs
nova.vm.mem.total_allocated_mb
service=compute
hostname
component
control_plane
cluster
Total Mbytes of memory allocated to all VMs

13.1.4.5 Crash Metrics

A list of metrics associated with the Crash service.

Metric NameDimensionsDescription
crash.dump_count
service=system
hostname
cluster
Number of crash dumps found

13.1.4.6 Directory Metrics

A list of metrics associated with the Directory service.

Metric NameDimensionsDescription
directory.files_count
service
hostname
path
Total number of files under a specific directory path
directory.size_bytes
service
hostname
path
Total size of a specific directory path

13.1.4.7 Elasticsearch Metrics

A list of metrics associated with the Elasticsearch service.

Metric NameDimensionsDescription
elasticsearch.active_primary_shards
service=logging
url
hostname

Indicates the number of primary shards in your cluster. This is an aggregate total across all indices.

elasticsearch.active_shards
service=logging
url
hostname

Aggregate total of all shards across all indices, which includes replica shards.

elasticsearch.cluster_status
service=logging
url
hostname

Cluster health status.

elasticsearch.initializing_shards
service=logging
url
hostname

The count of shards that are being freshly created.

elasticsearch.number_of_data_nodes
service=logging
url
hostname

Number of data nodes.

elasticsearch.number_of_nodes
service=logging
url
hostname

Number of nodes.

elasticsearch.relocating_shards
service=logging
url
hostname

Shows the number of shards that are currently moving from one node to another node.

elasticsearch.unassigned_shards
service=logging
url
hostname

The number of unassigned shards from the master node.

13.1.4.8 HAProxy Metrics

A list of metrics associated with the HAProxy service.

Metric NameDimensionsDescription
haproxy.backend.bytes.in_rate  
haproxy.backend.bytes.out_rate  
haproxy.backend.denied.req_rate  
haproxy.backend.denied.resp_rate  
haproxy.backend.errors.con_rate  
haproxy.backend.errors.resp_rate  
haproxy.backend.queue.current  
haproxy.backend.response.1xx  
haproxy.backend.response.2xx  
haproxy.backend.response.3xx  
haproxy.backend.response.4xx  
haproxy.backend.response.5xx  
haproxy.backend.response.other  
haproxy.backend.session.current  
haproxy.backend.session.limit  
haproxy.backend.session.pct  
haproxy.backend.session.rate  
haproxy.backend.warnings.redis_rate  
haproxy.backend.warnings.retr_rate  
haproxy.frontend.bytes.in_rate  
haproxy.frontend.bytes.out_rate  
haproxy.frontend.denied.req_rate  
haproxy.frontend.denied.resp_rate  
haproxy.frontend.errors.req_rate  
haproxy.frontend.requests.rate  
haproxy.frontend.response.1xx  
haproxy.frontend.response.2xx  
haproxy.frontend.response.3xx  
haproxy.frontend.response.4xx  
haproxy.frontend.response.5xx  
haproxy.frontend.response.other  
haproxy.frontend.session.current  
haproxy.frontend.session.limit  
haproxy.frontend.session.pct  
haproxy.frontend.session.rate  

13.1.4.9 HTTP Check Metrics

A list of metrics associated with the HTTP Check service:

Table 13.2: HTTP Check Metrics
Metric NameDimensionsDescription
http_response_time
url
hostname
service
component
The response time in seconds of the http endpoint call.
http_status
url
hostname
service
The status of the http endpoint call (0 = success, 1 = failure).

For each component and HTTP metric name there are two separate metrics reported, one for the local URL and another for the virtual IP (VIP) URL:

Table 13.3: HTTP Metric Components
ComponentDimensionsDescription
account-server
service=object-storage
component=account-server
url
swift account-server http endpoint status and response time
barbican-api
service=key-manager
component=barbican-api
url
barbican-api http endpoint status and response time
cinder-api
service=block-storage
component=cinder-api
url
cinder-api http endpoint status and response time
container-server
service=object-storage
component=container-server
url
swift container-server http endpoint status and response time
designate-api
service=dns
component=designate-api
url
designate-api http endpoint status and response time
glance-api
service=image-service
component=glance-api
url
glance-api http endpoint status and response time
glance-registry
service=image-service
component=glance-registry
url
glance-registry http endpoint status and response time
heat-api
service=orchestration
component=heat-api
url
heat-api http endpoint status and response time
heat-api-cfn
service=orchestration
component=heat-api-cfn
url
heat-api-cfn http endpoint status and response time
heat-api-cloudwatch
service=orchestration
component=heat-api-cloudwatch
url
heat-api-cloudwatch http endpoint status and response time
ardana-ux-services
service=ardana-ux-services
component=ardana-ux-services
url
ardana-ux-services http endpoint status and response time
horizon
service=web-ui
component=horizon
url
horizon http endpoint status and response time
keystone-api
service=identity-service
component=keystone-api
url
keystone-api http endpoint status and response time
monasca-api
service=monitoring
component=monasca-api
url
monasca-api http endpoint status
monasca-persister
service=monitoring
component=monasca-persister
url
monasca-persister http endpoint status
neutron-server
service=networking
component=neutron-server
url
neutron-server http endpoint status and response time
neutron-server-vip
service=networking
component=neutron-server-vip
url
neutron-server-vip http endpoint status and response time
nova-api
service=compute
component=nova-api
url
nova-api http endpoint status and response time
nova-vnc
service=compute
component=nova-vnc
url
nova-vnc http endpoint status and response time
object-server
service=object-storage
component=object-server
url
object-server http endpoint status and response time
object-storage-vip
service=object-storage
component=object-storage-vip
url
object-storage-vip http endpoint status and response time
octavia-api
service=octavia
component=octavia-api
url
octavia-api http endpoint status and response time
ops-console-web
service=ops-console
component=ops-console-web
url
ops-console-web http endpoint status and response time
proxy-server
service=object-storage
component=proxy-server
url
proxy-server http endpoint status and response time

13.1.4.10 Kafka Metrics

A list of metrics associated with the Kafka service.

Metric NameDimensionsDescription
kafka.consumer_lag
topic
service
component=kafka
consumer_group
hostname
Hostname consumer offset lag from broker offset

13.1.4.11 Libvirt Metrics

Note
Note

For information on how to turn these metrics on and off using the tuning knobs, see Section 13.1.2.5.1, “Libvirt plugin metric tuning knobs”.

A list of metrics associated with the Libvirt service.

Table 13.4: Tunable Libvirt Metrics
Admin Metric NameProject Metric NameDimensionsDescription
vm.cpu.time_nscpu.time_ns
zone
service
resource_id
hostname
component
Cumulative CPU time (in ns)
vm.cpu.utilization_norm_perccpu.utilization_norm_perc
zone
service
resource_id
hostname
component
Normalized CPU utilization (percentage)
vm.cpu.utilization_perccpu.utilization_perc
zone
service
resource_id
hostname
component
Overall CPU utilization (percentage)
vm.io.errorsio.errors
zone
service
resource_id
hostname
component
Overall disk I/O errors
vm.io.errors_secio.errors_sec
zone
service
resource_id
hostname
component
Disk I/O errors per second
vm.io.read_bytesio.read_bytes
zone
service
resource_id
hostname
component
Disk I/O read bytes value
vm.io.read_bytes_secio.read_bytes_sec
zone
service
resource_id
hostname
component
Disk I/O read bytes per second
vm.io.read_opsio.read_ops
zone
service
resource_id
hostname
component
Disk I/O read operations value
vm.io.read_ops_secio.read_ops_sec
zone
service
resource_id
hostname
component
Disk I/O write operations per second
vm.io.write_bytesio.write_bytes
zone
service
resource_id
hostname
component
Disk I/O write bytes value
vm.io.write_bytes_secio.write_bytes_sec
zone
service
resource_id
hostname
component
Disk I/O write bytes per second
vm.io.write_opsio.write_ops
zone
service
resource_id
hostname
component
Disk I/O write operations value
vm.io.write_ops_sec io.write_ops_sec
zone
service
resource_id
hostname
component
Disk I/O write operations per second
vm.net.in_bytesnet.in_bytes
zone
service
resource_id
hostname
component
device
port_id
Network received total bytes
vm.net.in_bytes_secnet.in_bytes_sec
zone
service
resource_id
hostname
component
device
port_id
Network received bytes per second
vm.net.in_packetsnet.in_packets
zone
service
resource_id
hostname
component
device
port_id
Network received total packets
vm.net.in_packets_secnet.in_packets_sec
zone
service
resource_id
hostname
component
device
port_id
Network received packets per second
vm.net.out_bytesnet.out_bytes
zone
service
resource_id
hostname
component
device
port_id
Network transmitted total bytes
vm.net.out_bytes_secnet.out_bytes_sec
zone
service
resource_id
hostname
component
device
port_id
Network transmitted bytes per second
vm.net.out_packetsnet.out_packets
zone
service
resource_id
hostname
component
device
port_id
Network transmitted total packets
vm.net.out_packets_secnet.out_packets_sec
zone
service
resource_id
hostname
component
device
port_id
Network transmitted packets per second
vm.ping_statusping_status
zone
service
resource_id
hostname
component
0 for ping success, 1 for ping failure
vm.disk.allocationdisk.allocation
zone
service
resource_id
hostname
component
Total Disk allocation for a device
vm.disk.allocation_totaldisk.allocation_total
zone
service
resource_id
hostname
component
Total Disk allocation across devices for instances
vm.disk.capacitydisk.capacity
zone
service
resource_id
hostname
component
Total Disk capacity for a device
vm.disk.capacity_totaldisk.capacity_total
zone
service
resource_id
hostname
component
Total Disk capacity across devices for instances
vm.disk.physicaldisk.physical
zone
service
resource_id
hostname
component
Total Disk usage for a device
vm.disk.physical_totaldisk.physical_total
zone
service
resource_id
hostname
component
Total Disk usage across devices for instances
vm.io.errors_totalio.errors_total
zone
service
resource_id
hostname
component
Total Disk I/O errors across all devices
vm.io.errors_total_secio.errors_total_sec
zone
service
resource_id
hostname
component
Total Disk I/O errors per second across all devices
vm.io.read_bytes_totalio.read_bytes_total
zone
service
resource_id
hostname
component
Total Disk I/O read bytes across all devices
vm.io.read_bytes_total_secio.read_bytes_total_sec
zone
service
resource_id
hostname
component
Total Disk I/O read bytes per second across devices
vm.io.read_ops_totalio.read_ops_total
zone
service
resource_id
hostname
component
Total Disk I/O read operations across all devices
vm.io.read_ops_total_secio.read_ops_total_sec
zone
service
resource_id
hostname
component
Total Disk I/O read operations across all devices per sec
vm.io.write_bytes_totalio.write_bytes_total
zone
service
resource_id
hostname
component
Total Disk I/O write bytes across all devices
vm.io.write_bytes_total_secio.write_bytes_total_sec
zone
service
resource_id
hostname
component
Total Disk I/O Write bytes per second across devices
vm.io.write_ops_totalio.write_ops_total
zone
service
resource_id
hostname
component
Total Disk I/O write operations across all devices
vm.io.write_ops_total_secio.write_ops_total_sec
zone
service
resource_id
hostname
component
Total Disk I/O write operations across all devices per sec

These metrics in libvirt are always enabled and cannot be disabled using the tuning knobs.

Table 13.5: Untunable Libvirt Metrics
Admin Metric NameProject Metric NameDimensionsDescription
vm.host_alive_statushost_alive_status
zone
service
resource_id
hostname
component

-1 for no status, 0 for Running / OK, 1 for Idle / blocked, 2 for Paused,

3 for Shutting down, 4 for Shut off or nova suspend 5 for Crashed,

6 for Power management suspend (S3 state)

vm.mem.free_mbmem.free_mb
cluster
service
hostname
Free memory in Mbytes
vm.mem.free_percmem.free_perc
cluster
service
hostname
Percent of memory free
vm.mem.resident_mb 
cluster
service
hostname
Total memory used on host, an Operations-only metric
vm.mem.swap_used_mbmem.swap_used_mb
cluster
service
hostname
Used swap space in Mbytes
vm.mem.total_mbmem.total_mb
cluster
service
hostname
Total memory in Mbytes
vm.mem.used_mbmem.used_mb
cluster
service
hostname
Used memory in Mbytes

13.1.4.12 Monitoring Metrics

A list of metrics associated with the Monitoring service.

Metric NameDimensionsDescription
alarm-state-transitions-added-to-batch-counter
service=monitoring
url
hostname
component=monasca-persister
 
jvm.memory.total.max
service=monitoring
url
hostname
component
Maximum JVM overall memory
jvm.memory.total.used
service=monitoring
url
hostname
component
Used JVM overall memory
metrics-added-to-batch-counter
service=monitoring
url
hostname
component=monasca-persister
 
metrics.published
service=monitoring
url
hostname
component=monasca-api
Total number of published metrics
monasca.alarms_finished_count
hostname
component=monasca-notification
service=monitoring
Total number of alarms received
monasca.checks_running_too_long
hostname
component=monasca-agent
service=monitoring
cluster
Only emitted when collection time for a check is too long
monasca.collection_time_sec
hostname
component=monasca-agent
service=monitoring
cluster
Collection time in monasca-agent
monasca.config_db_time
hostname
component=monasca-notification
service=monitoring
 
monasca.created_count
hostname
component=monasca-notification
service=monitoring
Number of notifications created
monasca.invalid_type_count
hostname
component=monasca-notification
service=monitoring
Number of notifications with invalid type
monasca.log.in_bulks_rejected
hostname
component=monasca-log-api
service=monitoring
version
 
monasca.log.in_logs
hostname
component=monasca-log-api
service=monitoring
version
 
monasca.log.in_logs_bytes
hostname
component=monasca-log-api
service=monitoring
version
 
monasca.log.in_logs_rejected
hostname
component=monasca-log-api
service=monitoring
version
 
monasca.log.out_logs
hostname
component=monasca-log-api
service=monitoring
 
monasca.log.out_logs_lost
hostname
component=monasca-log-api
service=monitoring
 
monasca.log.out_logs_truncated_bytes
hostname
component=monasca-log-api
service=monitoring
 
monasca.log.processing_time_ms
hostname
component=monasca-log-api
service=monitoring
 
monasca.log.publish_time_ms
hostname
component=monasca-log-api
service=monitoring
 
monasca.thread_count
service=monitoring
process_name
hostname
component
Number of threads monasca is using
raw-sql.time.avg
service=monitoring
url
hostname
component
Average raw sql query time
raw-sql.time.max
service=monitoring
url
hostname
component
Max raw sql query time

13.1.4.13 Monasca Aggregated Metrics

A list of the aggregated metrics associated with the monasca Transform feature.

Metric NameForDimensionsDescription
cpu.utilized_logical_cores_aggCompute summary
aggregation_period: hourly
host: all or <hostname>
project_id: all

Utilized physical host cpu core capacity for one or all hosts by time interval (defaults to a hour).

Available as total or per host

cpu.total_logical_cores_aggCompute summary
aggregation_period: hourly
host: all or <hostname>
project_id: all

Total physical host cpu core capacity for one or all hosts by time interval (defaults to a hour)

Available as total or per host

mem.total_mb_aggCompute summary
aggregation_period: hourly
host: all
project_id: all

Total physical host memory capacity by time interval (defaults to a hour)

mem.usable_mb_aggCompute summary
aggregation_period: hourly
host: all
project_id: all
Usable physical host memory capacity by time interval (defaults to a hour)
disk.total_used_space_mb_aggCompute summary
aggregation_period: hourly
host: all
project_id: all

Utilized physical host disk capacity by time interval (defaults to a hour)

disk.total_space_mb_aggCompute summary
aggregation_period: hourly
host: all
project_id: all
Total physical host disk capacity by time interval (defaults to a hour)
nova.vm.cpu.total_allocated_aggCompute summary
aggregation_period: hourly
host: all
project_id: all

CPUs allocated across all virtual machines by time interval (defaults to a hour)

vcpus_aggCompute summary
aggregation_period: hourly
host: all
project_id: all or <project ID>

Virtual CPUs allocated capacity for virtual machines of one or all projects by time interval (defaults to a hour)

Available as total or per host

nova.vm.mem.total_allocated_mb_aggCompute summary
aggregation_period: hourly
host: all
project_id: all

Memory allocated to all virtual machines by time interval (defaults to a hour)

vm.mem.used_mb_aggCompute summary
aggregation_period: hourly
host: all
project_id: all or <project ID>

Memory utilized by virtual machines of one or all projects by time interval (defaults to an hour)

Available as total or per host

vm.mem.total_mb_aggCompute summary
aggregation_period: hourly
host: all
project_id: all or <project ID>

Memory allocated to virtual machines of one or all projects by time interval (defaults to an hour)

Available as total or per host

vm.cpu.utilization_perc_aggCompute summary
aggregation_period: hourly
host: all
project_id: all or <project ID>

CPU utilized by all virtual machines by project by time interval (defaults to an hour)

nova.vm.disk.total_allocated_gb_aggCompute summary
aggregation_period: hourly
host: all
project_id: all

Disk space allocated to all virtual machines by time interval (defaults to an hour)

vm.disk.allocation_aggCompute summary
aggregation_period: hourly
host: all
project_id: all or <project ID>

Disk allocation for virtual machines of one or all projects by time interval (defaults to a hour)

Available as total or per host

swiftlm.diskusage.val.size_aggObject Storage summary
aggregation_period: hourly
host: all or <hostname>
project_id: all

Total available object storage capacity by time interval (defaults to a hour)

Available as total or per host

swiftlm.diskusage.val.avail_aggObject Storage summary
aggregation_period: hourly
host: all or <hostname>
project_id: all

Remaining object storage capacity by time interval (defaults to a hour)

Available as total or per host

swiftlm.diskusage.rate_aggObject Storage summary
aggregation_period: hourly
host: all
project_id: all

Rate of change of object storage usage by time interval (defaults to a hour)

storage.objects.size_aggObject Storage summary
aggregation_period: hourly
host: all
project_id: all

Used object storage capacity by time interval (defaults to a hour)

13.1.4.14 MySQL Metrics

A list of metrics associated with the MySQL service.

Metric NameDimensionsDescription
mysql.innodb.buffer_pool_free
hostname
mode
service=mysql

The number of free pages, in bytes. This value is calculated by multiplying Innodb_buffer_pool_pages_free and Innodb_page_size of the server status variable.

mysql.innodb.buffer_pool_total
hostname
mode
service=mysql

The total size of buffer pool, in bytes. This value is calculated by multiplying Innodb_buffer_pool_pages_total and Innodb_page_size of the server status variable.

mysql.innodb.buffer_pool_used
hostname
mode
service=mysql

The number of used pages, in bytes. This value is calculated by subtracting Innodb_buffer_pool_pages_total away from Innodb_buffer_pool_pages_free of the server status variable.

mysql.innodb.current_row_locks
hostname
mode
service=mysql

Corresponding to current row locks of the server status variable.

mysql.innodb.data_reads
hostname
mode
service=mysql

Corresponding to Innodb_data_reads of the server status variable.

mysql.innodb.data_writes
hostname
mode
service=mysql

Corresponding to Innodb_data_writes of the server status variable.

mysql.innodb.mutex_os_waits
hostname
mode
service=mysql

Corresponding to the OS waits of the server status variable.

mysql.innodb.mutex_spin_rounds
hostname
mode
service=mysql

Corresponding to spinlock rounds of the server status variable.

mysql.innodb.mutex_spin_waits
hostname
mode
service=mysql

Corresponding to the spin waits of the server status variable.

mysql.innodb.os_log_fsyncs
hostname
mode
service=mysql

Corresponding to Innodb_os_log_fsyncs of the server status variable.

mysql.innodb.row_lock_time
hostname
mode
service=mysql

Corresponding to Innodb_row_lock_time of the server status variable.

mysql.innodb.row_lock_waits
hostname
mode
service=mysql

Corresponding to Innodb_row_lock_waits of the server status variable.

mysql.net.connections
hostname
mode
service=mysql

Corresponding to Connections of the server status variable.

mysql.net.max_connections
hostname
mode
service=mysql

Corresponding to Max_used_connections of the server status variable.

mysql.performance.com_delete
hostname
mode
service=mysql

Corresponding to Com_delete of the server status variable.

mysql.performance.com_delete_multi
hostname
mode
service=mysql

Corresponding to Com_delete_multi of the server status variable.

mysql.performance.com_insert
hostname
mode
service=mysql

Corresponding to Com_insert of the server status variable.

mysql.performance.com_insert_select
hostname
mode
service=mysql

Corresponding to Com_insert_select of the server status variable.

mysql.performance.com_replace_select
hostname
mode
service=mysql

Corresponding to Com_replace_select of the server status variable.

mysql.performance.com_select
hostname
mode
service=mysql

Corresponding to Com_select of the server status variable.

mysql.performance.com_update
hostname
mode
service=mysql

Corresponding to Com_update of the server status variable.

mysql.performance.com_update_multi
hostname
mode
service=mysql

Corresponding to Com_update_multi of the server status variable.

mysql.performance.created_tmp_disk_tables
hostname
mode
service=mysql

Corresponding to Created_tmp_disk_tables of the server status variable.

mysql.performance.created_tmp_files
hostname
mode
service=mysql

Corresponding to Created_tmp_files of the server status variable.

mysql.performance.created_tmp_tables
hostname
mode
service=mysql

Corresponding to Created_tmp_tables of the server status variable.

mysql.performance.kernel_time
hostname
mode
service=mysql

The kernel time for the databases performance, in seconds.

mysql.performance.open_files
hostname
mode
service=mysql

Corresponding to Open_files of the server status variable.

mysql.performance.qcache_hits
hostname
mode
service=mysql

Corresponding to Qcache_hits of the server status variable.

mysql.performance.queries
hostname
mode
service=mysql

Corresponding to Queries of the server status variable.

mysql.performance.questions
hostname
mode
service=mysql

Corresponding to Question of the server status variable.

mysql.performance.slow_queries
hostname
mode
service=mysql

Corresponding to Slow_queries of the server status variable.

mysql.performance.table_locks_waited
hostname
mode
service=mysql

Corresponding to Table_locks_waited of the server status variable.

mysql.performance.threads_connected
hostname
mode
service=mysql

Corresponding to Threads_connected of the server status variable.

mysql.performance.user_time
hostname
mode
service=mysql

The CPU user time for the databases performance, in seconds.

13.1.4.15 NTP Metrics

A list of metrics associated with the NTP service.

Metric NameDimensionsDescription
ntp.connection_status
hostname
ntp_server
Value of ntp server connection status (0=Healthy)
ntp.offset
hostname
ntp_server
Time offset in seconds

13.1.4.16 Open vSwitch (OVS) Metrics

A list of metrics associated with the OVS service.

Note
Note

For information on how to turn these metrics on and off using the tuning knobs, see Section 13.1.2.5.2, “OVS plugin metric tuning knobs”.

Table 13.6: Per-router metrics
Admin Metric NameProject Metric NameDimensionsDescription
ovs.vrouter.in_bytes_secvrouter.in_bytes_sec
service=networking
resource_id
component=ovs
router_name
port_id

Inbound bytes per second for the router (if network_use_bits is false)

ovs.vrouter.in_packets_secvrouter.in_packets_sec
service=networking
resource_id
tenant_id
component=ovs
router_name
port_id

Incoming packets per second for the router

ovs.vrouter.out_bytes_secvrouter.out_bytes_sec
service=networking
resource_id
component=ovs
router_name
port_id

Outgoing bytes per second for the router (if network_use_bits is false)

ovs.vrouter.out_packets_secvrouter.out_packets_sec
service=networking
resource_id
tenant_id
component=ovs
router_name
port_id

Outgoing packets per second for the router

ovs.vrouter.in_bytesvrouter.in_bytes
service=networking
resource_id
tenant_id
component=ovs
router_name
port_id

Inbound bytes for the router (if network_use_bits is false)

ovs.vrouter.in_packetsvrouter.in_packets
service=networking
resource_id
tenant_id
component=ovs
router_name
port_id

Incoming packets for the router

ovs.vrouter.out_bytesvrouter.out_bytes
service=networking
resource_id
tenant_id
component=ovs
router_name
port_id

Outgoing bytes for the router (if network_use_bits is false)

ovs.vrouter.out_packetsvrouter.out_packets
service=networking
resource_id
tenant_id
component=ovs
router_name
port_id

Outgoing packets for the router

ovs.vrouter.in_dropped_secvrouter.in_dropped_sec
service=networking
resource_id
tenant_id
component=ovs
router_name
port_id

Incoming dropped packets per second for the router

ovs.vrouter.in_errors_secvrouter.in_errors_sec
service=networking
resource_id
component=ovs
router_name
port_id

Number of incoming errors per second for the router

ovs.vrouter.out_dropped_secvrouter.out_dropped_sec
service=networking
resource_id
component=ovs
router_name
port_id

Outgoing dropped packets per second for the router

ovs.vrouter.out_errors_secvrouter.out_errors_sec
service=networking
resource_id
component=ovs
router_name
port_id

Number of outgoing errors per second for the router

ovs.vrouter.in_droppedvrouter.in_dropped
service=networking
resource_id
tenant_id
component=ovs
router_name
port_id

Incoming dropped packets for the router

ovs.vrouter.in_errorsvrouter.in_errors
service=networking
resource_id
component=ovs
router_name
port_id

Number of incoming errors for the router

ovs.vrouter.out_droppedvrouter.out_dropped
service=networking
resource_id
component=ovs
router_name
port_id

Outgoing dropped packets for the router

ovs.vrouter.out_errorsvrouter.out_errors
service=networking
resource_id
tenant_id
component=ovs
router_name
port_id

Number of outgoing errors for the router

Table 13.7: Per-DHCP port and rate metrics
Admin Metric NameTenant Metric NameDimensionsDescription
ovs.vswitch.in_bytes_secvswitch.in_bytes_sec
service=networking
resource_id
component=ovs
router_name
port_id

Incoming Bytes per second on DHCP port(ifnetwork_use_bits is false)

ovs.vswitch.in_packets_secvswitch.in_packets_sec
service=networking
resource_id
tenant_id
component=ovs
router_name
port_id

Incoming packets per second for the DHCP port

ovs.vswitch.out_bytes_secvswitch.out_bytes_sec
service=networking
resource_id
component=ovs
router_name
port_id

Outgoing Bytes per second on DHCP port(ifnetwork_use_bits is false)

ovs.vswitch.out_packets_secvswitch.out_packets_sec
service=networking
resource_id
tenant_id
component=ovs
router_name
port_id

Outgoing packets per second for the DHCP port

ovs.vswitch.in_bytesvswitch.in_bytes
service=networking
resource_id
tenant_id
component=ovs
router_name
port_id

Inbound bytes for the DHCP port (if network_use_bits is false)

ovs.vswitch.in_packetsvswitch.in_packets
service=networking
resource_id
tenant_id
component=ovs
router_name
port_id

Incoming packets for the DHCP port

ovs.vswitch.out_bytesvswitch.out_bytes
service=networking
resource_id
tenant_id
component=ovs
router_name
port_id

Outgoing bytes for the DHCP port (if network_use_bits is false)

ovs.vswitch.out_packetsvswitch.out_packets
service=networking
resource_id
tenant_id
component=ovs
router_name
port_id

Outgoing packets for the DHCP port

ovs.vswitch.in_dropped_secvswitch.in_dropped_sec
service=networking
resource_id
tenant_id
component=ovs
router_name
port_id

Incoming dropped per second for the DHCP port

ovs.vswitch.in_errors_secvswitch.in_errors_sec
service=networking
resource_id
component=ovs
router_name
port_id

Incoming errors per second for the DHCP port

ovs.vswitch.out_dropped_secvswitch.out_dropped_sec
service=networking
resource_id
component=ovs
router_name
port_id

Outgoing dropped packets per second for the DHCP port

ovs.vswitch.out_errors_secvswitch.out_errors_sec
service=networking
resource_id
component=ovs
router_name
port_id

Outgoing errors per second for the DHCP port

ovs.vswitch.in_droppedvswitch.in_dropped
service=networking
resource_id
tenant_id
component=ovs
router_name
port_id

Incoming dropped packets for the DHCP port

ovs.vswitch.in_errorsvswitch.in_errors
service=networking
resource_id
component=ovs
router_name
port_id

Errors received for the DHCP port

ovs.vswitch.out_droppedvswitch.out_dropped
service=networking
resource_id
component=ovs
router_name
port_id

Outgoing dropped packets for the DHCP port

ovs.vswitch.out_errorsvswitch.out_errors
service=networking
resource_id
tenant_id
component=ovs
router_name
port_id

Errors transmitted for the DHCP port

13.1.4.17 Process Metrics

A list of metrics associated with processes.

Metric NameDimensionsDescription
process.cpu_perc
hostname
service
process_name
component
Percentage of cpu being consumed by a process
process.io.read_count
hostname
service
process_name
component
Number of reads by a process
process.io.read_kbytes
hostname
service
process_name
component
Kbytes read by a process
process.io.write_count
hostname
service
process_name
component
Number of writes by a process
process.io.write_kbytes
hostname
service
process_name
component
Kbytes written by a process
process.mem.rss_mbytes
hostname
service
process_name
component
Amount of physical memory allocated to a process, including memory from shared libraries in Mbytes
process.open_file_descriptors
hostname
service
process_name
component
Number of files being used by a process
process.pid_count
hostname
service
process_name
component
Number of processes that exist with this process name
process.thread_count
hostname
service
process_name
component
Number of threads a process is using
13.1.4.17.1 process.cpu_perc, process.mem.rss_mbytes, process.pid_count and process.thread_count metrics
Component NameDimensionsDescription
apache-storm
service=monitoring
process_name=monasca-thresh
process_user=storm
apache-storm process info: cpu percent, momory, pid count and thread count
barbican-api
service=key-manager
process_name=barbican-api
barbican-api process info: cpu percent, momory, pid count and thread count
ceilometer-agent-notification
service=telemetry
process_name=ceilometer-agent-notification
ceilometer-agent-notification process info: cpu percent, momory, pid count and thread count
ceilometer-polling
service=telemetry
process_name=ceilometer-polling
ceilometer-polling process info: cpu percent, momory, pid count and thread count
cinder-api
service=block-storage
process_name=cinder-api
cinder-api process info: cpu percent, momory, pid count and thread count
cinder-scheduler
service=block-storage
process_name=cinder-scheduler
cinder-scheduler process info: cpu percent, momory, pid count and thread count
designate-api
service=dns
process_name=designate-api
designate-api process info: cpu percent, momory, pid count and thread count
designate-central
service=dns
process_name=designate-central
designate-central process info: cpu percent, momory, pid count and thread count
designate-mdns
service=dns
process_name=designate-mdns
designate-mdns process cpu percent, momory, pid count and thread count
designate-pool-manager
service=dns
process_name=designate-pool-manager
designate-pool-manager process info: cpu percent, momory, pid count and thread count
heat-api
service=orchestration
process_name=heat-api
heat-api process cpu percent, momory, pid count and thread count
heat-api-cfn
service=orchestration
process_name=heat-api-cfn
heat-api-cfn process info: cpu percent, momory, pid count and thread count
heat-api-cloudwatch
service=orchestration
process_name=heat-api-cloudwatch
heat-api-cloudwatch process cpu percent, momory, pid count and thread count
heat-engine
service=orchestration
process_name=heat-engine
heat-engine process info: cpu percent, momory, pid count and thread count
ipsec/charon
service=networking
process_name=ipsec/charon
ipsec/charon process info: cpu percent, momory, pid count and thread count
keystone-admin
service=identity-service
process_name=keystone-admin
keystone-admin process info: cpu percent, momory, pid count and thread count
keystone-main
service=identity-service
process_name=keystone-main
keystone-main process info: cpu percent, momory, pid count and thread count
monasca-agent
service=monitoring
process_name=monasca-agent
monasca-agent process info: cpu percent, momory, pid count and thread count
monasca-api
service=monitoring
process_name=monasca-api
monasca-api process info: cpu percent, momory, pid count and thread count
monasca-notification
service=monitoring
process_name=monasca-notification
monasca-notification process info: cpu percent, momory, pid count and thread count
monasca-persister
service=monitoring
process_name=monasca-persister
monasca-persister process info: cpu percent, momory, pid count and thread count
monasca-transform
service=monasca-transform
process_name=monasca-transform
monasca-transform process info: cpu percent, momory, pid count and thread count
neutron-dhcp-agent
service=networking
process_name=neutron-dhcp-agent
neutron-dhcp-agent process info: cpu percent, momory, pid count and thread count
neutron-l3-agent
service=networking
process_name=neutron-l3-agent
neutron-l3-agent process info: cpu percent, momory, pid count and thread count
neutron-metadata-agent
service=networking
process_name=neutron-metadata-agent
neutron-metadata-agent process info: cpu percent, momory, pid count and thread count
neutron-openvswitch-agent
service=networking
process_name=neutron-openvswitch-agent
neutron-openvswitch-agent process info: cpu percent, momory, pid count and thread count
neutron-rootwrap
service=networking
process_name=neutron-rootwrap
neutron-rootwrap process info: cpu percent, momory, pid count and thread count
neutron-server
service=networking
process_name=neutron-server
neutron-server process info: cpu percent, momory, pid count and thread count
neutron-vpn-agent
service=networking
process_name=neutron-vpn-agent
neutron-vpn-agent process info: cpu percent, momory, pid count and thread count
nova-api
service=compute
process_name=nova-api
nova-api process info: cpu percent, momory, pid count and thread count
nova-compute
service=compute
process_name=nova-compute
nova-compute process info: cpu percent, momory, pid count and thread count
nova-conductor
service=compute
process_name=nova-conductor
nova-conductor process info: cpu percent, momory, pid count and thread count
nova-novncproxy
service=compute
process_name=nova-novncproxy
nova-novncproxy process info: cpu percent, momory, pid count and thread count
nova-scheduler
service=compute
process_name=nova-scheduler
nova-scheduler process info: cpu percent, momory, pid count and thread count
octavia-api
service=octavia
process_name=octavia-api
octavia-api process info: cpu percent, momory, pid count and thread count
octavia-health-manager
service=octavia
process_name=octavia-health-manager
octavia-health-manager process info: cpu percent, momory, pid count and thread count
octavia-housekeeping
service=octavia
process_name=octavia-housekeeping
octavia-housekeeping process info: cpu percent, momory, pid count and thread count
octavia-worker
service=octavia
process_name=octavia-worker
octavia-worker process info: cpu percent, momory, pid count and thread count
org.apache.spark.deploy.master.Master
service=spark
process_name=org.apache.spark.deploy.master.Master
org.apache.spark.deploy.master.Master process info: cpu percent, momory, pid count and thread count
org.apache.spark.executor.CoarseGrainedExecutorBackend
service=monasca-transform
process_name=org.apache.spark.executor.CoarseGrainedExecutorBackend
org.apache.spark.executor.CoarseGrainedExecutorBackend process info: cpu percent, momory, pid count and thread count
pyspark
service=monasca-transform
process_name=pyspark
pyspark process info: cpu percent, momory, pid count and thread count
transform/lib/driver
service=monasca-transform
process_name=transform/lib/driver
transform/lib/driver process info: cpu percent, momory, pid count and thread count
cassandra
service=cassandra
process_name=cassandra
cassandra process info: cpu percent, momory, pid count and thread count
13.1.4.17.2 process.io.*, process.open_file_descriptors metrics
Component NameDimensionsDescription
monasca-agent
service=monitoring
process_name=monasca-agent
process_user=mon-agent
monasca-agent process info: number of reads, number of writes,number of files being used

13.1.4.18 RabbitMQ Metrics

A list of metrics associated with the RabbitMQ service.

Metric NameDimensionsDescription
rabbitmq.exchange.messages.published_count
hostname
exchange
vhost
type
service=rabbitmq

Value of the "publish_out" field of "message_stats" object

rabbitmq.exchange.messages.published_rate
hostname
exchange
vhost
type
service=rabbitmq

Value of the "rate" field of "message_stats/publish_out_details" object

rabbitmq.exchange.messages.received_count
hostname
exchange
vhost
type
service=rabbitmq

Value of the "publish_in" field of "message_stats" object

rabbitmq.exchange.messages.received_rate
hostname
exchange
vhost
type
service=rabbitmq

Value of the "rate" field of "message_stats/publish_in_details" object

rabbitmq.node.fd_used
hostname
node
service=rabbitmq

Value of the "fd_used" field in the response of /api/nodes

rabbitmq.node.mem_used
hostname
node
service=rabbitmq

Value of the "mem_used" field in the response of /api/nodes

rabbitmq.node.run_queue
hostname
node
service=rabbitmq

Value of the "run_queue" field in the response of /api/nodes

rabbitmq.node.sockets_used
hostname
node
service=rabbitmq

Value of the "sockets_used" field in the response of /api/nodes

rabbitmq.queue.messages
hostname
queue
vhost
service=rabbitmq

Sum of ready and unacknowledged messages (queue depth)

rabbitmq.queue.messages.deliver_rate
hostname
queue
vhost
service=rabbitmq

Value of the "rate" field of "message_stats/deliver_details" object

rabbitmq.queue.messages.publish_rate
hostname
queue
vhost
service=rabbitmq

Value of the "rate" field of "message_stats/publish_details" object

rabbitmq.queue.messages.redeliver_rate
hostname
queue
vhost
service=rabbitmq

Value of the "rate" field of "message_stats/redeliver_details" object

13.1.4.19 Swift Metrics

A list of metrics associated with the swift service.

Metric NameDimensionsDescription
swiftlm.access.host.operation.get.bytes
service=object-storage

This metric is the number of bytes read from objects in GET requests processed by this host during the last minute. Only successful GET requests to objects are counted. GET requests to the account or container is not included.

swiftlm.access.host.operation.ops
service=object-storage

This metric is a count of the all the API requests made to swift that were processed by this host during the last minute.

swiftlm.access.host.operation.project.get.bytes  
swiftlm.access.host.operation.project.ops  
swiftlm.access.host.operation.project.put.bytes  
swiftlm.access.host.operation.put.bytes
service=object-storage

This metric is the number of bytes written to objects in PUT or POST requests processed by this host during the last minute. Only successful requests to objects are counted. Requests to the account or container is not included.

swiftlm.access.host.operation.status  
swiftlm.access.project.operation.status
service=object-storage

This metric reports whether the swiftlm-access-log-tailer program is running normally.

swiftlm.access.project.operation.ops
tenant_id
service=object-storage

This metric is a count of the all the API requests made to swift that were processed by this host during the last minute to a given project id.

swiftlm.access.project.operation.get.bytes
tenant_id
service=object-storage

This metric is the number of bytes read from objects in GET requests processed by this host for a given project during the last minute. Only successful GET requests to objects are counted. GET requests to the account or container is not included.

swiftlm.access.project.operation.put.bytes
tenant_id
service=object-storage

This metric is the number of bytes written to objects in PUT or POST requests processed by this host for a given project during the last minute. Only successful requests to objects are counted. Requests to the account or container is not included.

swiftlm.async_pending.cp.total.queue_length
observer_host
service=object-storage

This metric reports the total length of all async pending queues in the system.

When a container update fails, the update is placed on the async pending queue. An update may fail becuase the container server is too busy or because the server is down or failed. Later the system will “replay” updates from the queue – so eventually, the container listings will show all objects known to the system.

If you know that container servers are down, it is normal to see the value of async pending increase. Once the server is restored, the value should return to zero.

A non-zero value may also indicate that containers are too large. Look for “lock timeout” messages in /var/log/swift/swift.log. If you find such messages consider reducing the container size or enable rate limiting.

swiftlm.check.failure
check
error
component
service=object-storage

The total exception string is truncated if longer than 1919 characters and an ellipsis is prepended in the first three characters of the message. If there is more than one error reported, the list of errors is paired to the last reported error and the operator is expected to resolve failures until no more are reported. Where there are no further reported errors, the Value Class is emitted as ‘Ok’.

swiftlm.diskusage.cp.avg.usage
observer_host
service=object-storage

Is the average utilization of all drives in the system. The value is a percentage (example: 30.0 means 30% of the total space is used).

swiftlm.diskusage.cp.max.usage
observer_host
service=object-storage

Is the highest utilization of all drives in the system. The value is a percentage (example: 80.0 means at least one drive is 80% utilized). The value is just as important as swiftlm.diskusage.usage.avg. For example, if swiftlm.diskusage.usage.avg is 70% you might think that there is plenty of space available. However, if swiftlm.diskusage.usage.max is 100%, this means that some objects cannot be stored on that drive. swift will store replicas on other drives. However, this will create extra overhead.

swiftlm.diskusage.cp.min.usage
observer_host
service=object-storage

Is the lowest utilization of all drives in the system. The value is a percentage (example: 10.0 means at least one drive is 10% utilized)

swiftlm.diskusage.cp.total.avail
observer_host
service=object-storage

Is the size in bytes of available (unused) space of all drives in the system. Only drives used by swift are included.

swiftlm.diskusage.cp.total.size
observer_host
service=object-storage

Is the size in bytes of raw size of all drives in the system.

swiftlm.diskusage.cp.total.used
observer_host
service=object-storage

Is the size in bytes of used space of all drives in the system. Only drives used by swift are included.

swiftlm.diskusage.host.avg.usage
hostname
service=object-storage

This metric reports the average percent usage of all swift filesystems on a host.

swiftlm.diskusage.host.max.usage
hostname
service=object-storage

This metric reports the percent usage of a swift filesystem that is most used (full) on a host. The value is the max of the percentage used of all swift filesystems.

swiftlm.diskusage.host.min.usage
hostname
service=object-storage

This metric reports the percent usage of a swift filesystem that is least used (has free space) on a host. The value is the min of the percentage used of all swift filesystems.

swiftlm.diskusage.host.val.avail
hostname
service=object-storage
mount
device
label

This metric reports the number of bytes available (free) in a swift filesystem. The value is an integer (units: Bytes)

swiftlm.diskusage.host.val.size
hostname
service=object-storage
mount
device
label

This metric reports the size in bytes of a swift filesystem. The value is an integer (units: Bytes)

swiftlm.diskusage.host.val.usage
hostname
service=object-storage
mount
device
label

This metric reports the percent usage of a swift filesystem. The value is a floating point number in range 0.0 to 100.0

swiftlm.diskusage.host.val.used
hostname
service=object-storage
mount
device
label

This metric reports the number of used bytes in a swift filesystem. The value is an integer (units: Bytes)

swiftlm.load.cp.avg.five
observer_host
service=object-storage

This is the averaged value of the five minutes system load average of all nodes in the swift system.

swiftlm.load.cp.max.five
observer_host
service=object-storage

This is the five minute load average of the busiest host in the swift system.

swiftlm.load.cp.min.five
observer_host
service=object-storage

This is the five minute load average of the least loaded host in the swift system.

swiftlm.load.host.val.five
hostname
service=object-storage

This metric reports the 5 minute load average of a host. The value is derived from /proc/loadavg.

swiftlm.md5sum.cp.check.ring_checksums
observer_host
service=object-storage

If you are in the middle of deploying new rings, it is normal for this to be in the failed state.

However, if you are not in the middle of a deployment, you need to investigate the cause. Use “swift-recon –md5 -v” to identify the problem hosts.

swiftlm.replication.cp.avg.account_duration
observer_host
service=object-storage

This is the average across all servers for the account replicator to complete a cycle. As the system becomes busy, the time to complete a cycle increases. The value is in seconds.

swiftlm.replication.cp.avg.container_duration
observer_host
service=object-storage

This is the average across all servers for the container replicator to complete a cycle. As the system becomes busy, the time to complete a cycle increases. The value is in seconds.

swiftlm.replication.cp.avg.object_duration
observer_host
service=object-storage

This is the average across all servers for the object replicator to complete a cycle. As the system becomes busy, the time to complete a cycle increases. The value is in seconds.

swiftlm.replication.cp.max.account_last
hostname
path
service=object-storage

This is the number of seconds since the account replicator last completed a scan on the host that has the oldest completion time. Normally the replicators runs periodically and hence this value will decrease whenever a replicator completes. However, if a replicator is not completing a cycle, this value increases (by one second for each second that the replicator is not completing). If the value remains high and increasing for a long period of time, it indicates that one of the hosts is not completing the replication cycle.

swiftlm.replication.cp.max.container_last
hostname
path
service=object-storage

This is the number of seconds since the container replicator last completed a scan on the host that has the oldest completion time. Normally the replicators runs periodically and hence this value will decrease whenever a replicator completes. However, if a replicator is not completing a cycle, this value increases (by one second for each second that the replicator is not completing). If the value remains high and increasing for a long period of time, it indicates that one of the hosts is not completing the replication cycle.

swiftlm.replication.cp.max.object_last
hostname
path
service=object-storage

This is the number of seconds since the object replicator last completed a scan on the host that has the oldest completion time. Normally the replicators runs periodically and hence this value will decrease whenever a replicator completes. However, if a replicator is not completing a cycle, this value increases (by one second for each second that the replicator is not completing). If the value remains high and increasing for a long period of time, it indicates that one of the hosts is not completing the replication cycle.

swiftlm.swift.drive_audit
hostname
service=object-storage
mount_point
kernel_device

If an unrecoverable read error (URE) occurs on a filesystem, the error is logged in the kernel log. The swift-drive-audit program scans the kernel log looking for patterns indicating possible UREs.

To get more information, log onto the node in question and run:

sudoswift-drive-audit/etc/swift/drive-audit.conf

UREs are common on large disk drives. They do not necessarily indicate that the drive is failed. You can use the xfs_repair command to attempt to repair the filesystem. Failing this, you may need to wipe the filesystem.

If UREs occur very often on a specific drive, this may indicate that the drive is about to fail and should be replaced.

swiftlm.swift.file_ownership.config
hostname
path
service

This metric reports if a directory or file has the appropriate owner. The check looks at swift configuration directories and files. It also looks at the top-level directories of mounted file systems (for example, /srv/node/disk0 and /srv/node/disk0/objects).

swiftlm.swift.file_ownership.data
hostname
path
service

This metric reports if a directory or file has the appropriate owner. The check looks at swift configuration directories and files. It also looks at the top-level directories of mounted file systems (for example, /srv/node/disk0 and /srv/node/disk0/objects).

swiftlm.swiftlm_check
hostname
service=object-storage

This indicates of the swiftlm monasca-agent Plug-in is running normally. If the status is failed, it probable that some or all metrics are no longer being reported.

swiftlm.swift.replication.account.last_replication
hostname
service=object-storage

This reports how long (in seconds) since the replicator process last finished a replication run. If the replicator is stuck, the time will keep increasing forever. The time a replicator normally takes depends on disk sizes and how much data needs to be replicated. However, a value over 24 hours is generally bad.

swiftlm.swift.replication.container.last_replication
hostname
service=object-storage

This reports how long (in seconds) since the replicator process last finished a replication run. If the replicator is stuck, the time will keep increasing forever. The time a replicator normally takes depends on disk sizes and how much data needs to be replicated. However, a value over 24 hours is generally bad.

swiftlm.swift.replication.object.last_replication
hostname
service=object-storage

This reports how long (in seconds) since the replicator process last finished a replication run. If the replicator is stuck, the time will keep increasing forever. The time a replicator normally takes depends on disk sizes and how much data needs to be replicated. However, a value over 24 hours is generally bad.

swiftlm.swift.swift_services
hostname
service=object-storage

This metric reports of the process as named in the component dimension and the msg value_meta is running or not.

Use the swift-start.yml playbook to attempt to restart the stopped process (it will start any process that has stopped – you do not need to specifically name the process).

swiftlm.swift.swift_services.check_ip_port
hostname
service=object-storage
component
Reports if a service is listening to the correct ip and port.
swiftlm.systems.check_mounts
hostname
service=object-storage
mount
device
label

This metric reports the mount state of each drive that should be mounted on this node.

swiftlm.systems.connectivity.connect_check
observer_host
url
target_port
service=object-storage

This metric reports if a server can connect to a VIPs. Currently the following VIPs are checked:

  • The keystone VIP used to validate tokens (normally port 5000)

swiftlm.systems.connectivity.memcache_check
observer_host
hostname
target_port
service=object-storage

This metric reports if memcached on the host as specified by the hostname dimension is accepting connections from the host running the check. The following value_meta.msg are used:

We successfully connected to <hostname> on port <target_port>

{
  "dimensions": {
    "hostname": "ardana-ccp-c1-m1-mgmt",
    "observer_host": "ardana-ccp-c1-m1-mgmt",
    "service": "object-storage",
    "target_port": "11211"
  },
  "metric": "swiftlm.systems.connectivity.memcache_check",
  "timestamp": 1449084058,
  "value": 0,
  "value_meta": {
    "msg": "ardana-ccp-c1-m1-mgmt:11211 ok"
  }
}

We failed to connect to <hostname> on port <target_port>

{
  "dimensions": {
    "fail_message": "[Errno 111] Connection refused",
    "hostname": "ardana-ccp-c1-m1-mgmt",
    "observer_host": "ardana-ccp-c1-m1-mgmt",
    "service": "object-storage",
    "target_port": "11211"
  },
  "metric": "swiftlm.systems.connectivity.memcache_check",
  "timestamp": 1449084150,
  "value": 2,
  "value_meta": {
    "msg": "ardana-ccp-c1-m1-mgmt:11211 [Errno 111] Connection refused"
  }
}
swiftlm.systems.connectivity.rsync_check
observer_host
hostname
target_port
service=object-storage

This metric reports if rsyncd on the host as specified by the hostname dimension is accepting connections from the host running the check. The following value_meta.msg are used:

We successfully connected to <hostname> on port <target_port>:

{
  "dimensions": {
    "hostname": "ardana-ccp-c1-m1-mgmt",
    "observer_host": "ardana-ccp-c1-m1-mgmt",
    "service": "object-storage",
    "target_port": "873"
  },
  "metric": "swiftlm.systems.connectivity.rsync_check",
  "timestamp": 1449082663,
  "value": 0,
  "value_meta": {
    "msg": "ardana-ccp-c1-m1-mgmt:873 ok"
  }
}

We failed to connect to <hostname> on port <target_port>:

{
  "dimensions": {
    "fail_message": "[Errno 111] Connection refused",
    "hostname": "ardana-ccp-c1-m1-mgmt",
    "observer_host": "ardana-ccp-c1-m1-mgmt",
    "service": "object-storage",
    "target_port": "873"
  },
  "metric": "swiftlm.systems.connectivity.rsync_check",
  "timestamp": 1449082860,
  "value": 2,
  "value_meta": {
    "msg": "ardana-ccp-c1-m1-mgmt:873 [Errno 111] Connection refused"
  }
}
swiftlm.umon.target.avg.latency_sec
component
hostname
observer_host
service=object-storage
url

Reports the average value of N-iterations of the latency values recorded for a component.

swiftlm.umon.target.check.state
component
hostname
observer_host
service=object-storage
url

This metric reports the state of each component after N-iterations of checks. If the initial check succeeds, the checks move onto the next component until all components are queried, then the checks sleep for ‘main_loop_interval’ seconds. If a check fails, it is retried every second for ‘retries’ number of times per component. If the check fails ‘retries’ times, it is reported as a fail instance.

A successful state will be reported in JSON:

{
    "dimensions": {
        "component": "rest-api",
        "hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt",
        "observer_host": "ardana-ccp-c1-m1-mgmt",
        "service": "object-storage",
        "url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080"
    },
    "metric": "swiftlm.umon.target.check.state",
    "timestamp": 1453111805,
    "value": 0
},

A failed state will report a “fail” value and the value_meta will provide the http response error.

{
    "dimensions": {
        "component": "rest-api",
        "hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt",
        "observer_host": "ardana-ccp-c1-m1-mgmt",
        "service": "object-storage",
        "url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080"
    },
    "metric": "swiftlm.umon.target.check.state",
    "timestamp": 1453112841,
    "value": 2,
    "value_meta": {
        "msg": "HTTPConnectionPool(host='192.168.245.9', port=8080): Max retries exceeded with url: /v1/AUTH_76538ce683654a35983b62e333001b47 (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fd857d7f550>: Failed to establish a new connection: [Errno 110] Connection timed out',))"
    }
}
swiftlm.umon.target.max.latency_sec
component
hostname
observer_host
service=object-storage
url

This metric reports the maximum response time in seconds of a REST call from the observer to the component REST API listening on the reported host

A response time query will be reported in JSON:

{
    "dimensions": {
        "component": "rest-api",
        "hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt",
        "observer_host": "ardana-ccp-c1-m1-mgmt",
        "service": "object-storage",
        "url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080"
    },
    "metric": "swiftlm.umon.target.max.latency_sec",
    "timestamp": 1453111805,
    "value": 0.2772650718688965
}

A failed query will have a much longer time value:

{
    "dimensions": {
        "component": "rest-api",
        "hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt",
        "observer_host": "ardana-ccp-c1-m1-mgmt",
        "service": "object-storage",
        "url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080"
    },
    "metric": "swiftlm.umon.target.max.latency_sec",
    "timestamp": 1453112841,
    "value": 127.288015127182
}
swiftlm.umon.target.min.latency_sec
component
hostname
observer_host
service=object-storage
url

This metric reports the minimum response time in seconds of a REST call from the observer to the component REST API listening on the reported host

A response time query will be reported in JSON:

{
    "dimensions": {
        "component": "rest-api",
        "hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt",
        "observer_host": "ardana-ccp-c1-m1-mgmt",
        "service": "object-storage",
        "url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080"
    },
    "metric": "swiftlm.umon.target.min.latency_sec",
    "timestamp": 1453111805,
    "value": 0.10025882720947266
}

A failed query will have a much longer time value:

{
    "dimensions": {
        "component": "rest-api",
        "hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt",
        "observer_host": "ardana-ccp-c1-m1-mgmt",
        "service": "object-storage",
        "url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080"
    },
    "metric": "swiftlm.umon.target.min.latency_sec",
    "timestamp": 1453112841,
    "value": 127.25378203392029
}
swiftlm.umon.target.val.avail_day
component
hostname
observer_host
service=object-storage
url

This metric reports the average of all the collected records in the swiftlm.umon.target.val.avail_minute metric data. This is a walking average data set of these approximately per-minute states of the swift Object Store. The most basic case is a whole day of successful per-minute records, which will average to 100% availability. If there is any downtime throughout the day resulting in gaps of data which are two minutes or longer, the per-minute availability data will be “back filled” with an assumption of a down state for all the per-minute records which did not exist during the non-reported time. Because this is a walking average of approximately 24 hours worth of data, any outtage will take 24 hours to be purged from the dataset.

A 24-hour average availability report:

{
    "dimensions": {
        "component": "rest-api",
        "hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt",
        "observer_host": "ardana-ccp-c1-m1-mgmt",
        "service": "object-storage",
        "url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080"
    },
    "metric": "swiftlm.umon.target.val.avail_day",
    "timestamp": 1453645405,
    "value": 7.894736842105263
}
swiftlm.umon.target.val.avail_minute
component
hostname
observer_host
service=object-storage
url

A value of 100 indicates that swift-uptime-monitor was able to get a token from keystone and was able to perform operations against the swift API during the reported minute. A value of zero indicates that either keystone or swift failed to respond successfully. A metric is produced every minute that swift-uptime-monitor is running.

An “up” minute report value will report 100 [percent]:

{
    "dimensions": {
        "component": "rest-api",
        "hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt",
        "observer_host": "ardana-ccp-c1-m1-mgmt",
        "service": "object-storage",
        "url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080"
    },
    "metric": "swiftlm.umon.target.val.avail_minute",
    "timestamp": 1453645405,
    "value": 100.0
}

A “down” minute report value will report 0 [percent]:

{
    "dimensions": {
        "component": "rest-api",
        "hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt",
        "observer_host": "ardana-ccp-c1-m1-mgmt",
        "service": "object-storage",
        "url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080"
    },
    "metric": "swiftlm.umon.target.val.avail_minute",
    "timestamp": 1453649139,
    "value": 0.0
}
swiftlm.hp_hardware.hpssacli.smart_array.firmware
component
hostname
service=object-storage
component
model
controller_slot

This metric reports the firmware version of a component of a Smart Array controller.

swiftlm.hp_hardware.hpssacli.smart_array
component
hostname
service=object-storage
component
sub_component
model
controller_slot

This reports the status of various sub-components of a Smart Array Controller.

A failure is considered to have occured if:

  • Controller is failed

  • Cache is not enabled or has failed

  • Battery or capacitor is not installed

  • Battery or capacitor has failed

swiftlm.hp_hardware.hpssacli.physical_drive
component
hostname
service=object-storage
component
controller_slot
box
bay

This reports the status of a disk drive attached to a Smart Array controller.

swiftlm.hp_hardware.hpssacli.logical_drive
component
hostname
observer_host
service=object-storage
controller_slot
array
logical_drive
sub_component

This reports the status of a LUN presented by a Smart Array controller.

A LUN is considered failed if the LUN has failed or if the LUN cache is not enabled and working.

Note
Note
  • HPE Smart Storage Administrator (HPE SSA) CLI component will have to be installed on all control nodes that are swift nodes, in order to generate the following swift metrics:

    • swiftlm.hp_hardware.hpssacli.smart_array

    • swiftlm.hp_hardware.hpssacli.logical_drive

    • swiftlm.hp_hardware.hpssacli.smart_array.firmware

    • swiftlm.hp_hardware.hpssacli.physical_drive

  • HPE-specific binaries that are not based on open source are distributed directly from and supported by HPE. To download and install the SSACLI utility, please refer to: https://support.hpe.com/hpsc/swd/public/detail?swItemId=MTX_3d16386b418a443388c18da82f

  • After the HPE SSA CLI component is installed on the swift nodes, the metrics will be generated automatically during the next agent polling cycle. Manual reboot of the node is not required.

13.1.4.20 System Metrics

A list of metrics associated with the System.

Table 13.8: CPU Metrics
Metric NameDimensionsDescription
cpu.frequency_mhz
cluster
hostname
service=system

Maximum MHz value for the cpu frequency.

Note
Note

This value is dynamic, and driven by CPU governor depending on current resource need.

cpu.idle_perc
cluster
hostname
service=system

Percentage of time the CPU is idle when no I/O requests are in progress

cpu.idle_time
cluster
hostname
service=system

Time the CPU is idle when no I/O requests are in progress

cpu.percent
cluster
hostname
service=system

Percentage of time the CPU is used in total

cpu.stolen_perc
cluster
hostname
service=system

Percentage of stolen CPU time, that is, the time spent in other OS contexts when running in a virtualized environment

cpu.system_perc
cluster
hostname
service=system

Percentage of time the CPU is used at the system level

cpu.system_time
cluster
hostname
service=system

Time the CPU is used at the system level

cpu.time_ns
cluster
hostname
service=system

Time the CPU is used at the host level

cpu.total_logical_cores
cluster
hostname
service=system

Total number of logical cores available for an entire node (Includes hyper threading).

Note
Note:

This is an optional metric that is only sent when send_rollup_stats is set to true.

cpu.user_perc
cluster
hostname
service=system

Percentage of time the CPU is used at the user level

cpu.user_time
cluster
hostname
service=system

Time the CPU is used at the user level

cpu.wait_perc
cluster
hostname
service=system

Percentage of time the CPU is idle AND there is at least one I/O request in progress

cpu.wait_time
cluster
hostname
service=system

Time the CPU is idle AND there is at least one I/O request in progress

Table 13.9: Disk Metrics
Metric NameDimensionsDescription
disk.inode_used_perc
mount_point
service=system
hostname
cluster
device

The percentage of inodes that are used on a device

disk.space_used_perc
mount_point
service=system
hostname
cluster
device

The percentage of disk space that is being used on a device

disk.total_space_mb
mount_point
service=system
hostname
cluster
device

The total amount of disk space in Mbytes aggregated across all the disks on a particular node.

Note
Note

This is an optional metric that is only sent when send_rollup_stats is set to true.

disk.total_used_space_mb
mount_point
service=system
hostname
cluster
device

The total amount of used disk space in Mbytes aggregated across all the disks on a particular node.

Note
Note

This is an optional metric that is only sent when send_rollup_stats is set to true.

io.read_kbytes_sec
mount_point
service=system
hostname
cluster
device

Kbytes/sec read by an io device

io.read_req_sec
mount_point
service=system
hostname
cluster
device

Number of read requests/sec to an io device

io.read_time_sec
mount_point
service=system
hostname
cluster
device

Amount of read time in seconds to an io device

io.write_kbytes_sec
mount_point
service=system
hostname
cluster
device

Kbytes/sec written by an io device

io.write_req_sec
mount_point
service=system
hostname
cluster
device

Number of write requests/sec to an io device

io.write_time_sec
mount_point
service=system
hostname
cluster
device

Amount of write time in seconds to an io device

Table 13.10: Load Metrics
Metric NameDimensionsDescription
load.avg_15_min
service=system
hostname
cluster

The normalized (by number of logical cores) average system load over a 15 minute period

load.avg_1_min
service=system
hostname
cluster

The normalized (by number of logical cores) average system load over a 1 minute period

load.avg_5_min
service=system
hostname
cluster

The normalized (by number of logical cores) average system load over a 5 minute period

Table 13.11: Memory Metrics
Metric NameDimensionsDescription
mem.free_mb
service=system
hostname
cluster

Mbytes of free memory

mem.swap_free_mb
service=system
hostname
cluster

Percentage of free swap memory that is free

mem.swap_free_perc
service=system
hostname
cluster

Mbytes of free swap memory that is free

mem.swap_total_mb
service=system
hostname
cluster

Mbytes of total physical swap memory

mem.swap_used_mb
service=system
hostname
cluster

Mbytes of total swap memory used

mem.total_mb
service=system
hostname
cluster

Total Mbytes of memory

mem.usable_mb
service=system
hostname
cluster

Total Mbytes of usable memory

mem.usable_perc
service=system
hostname
cluster

Percentage of total memory that is usable

mem.used_buffers
service=system
hostname
cluster

Number of buffers in Mbytes being used by the kernel for block io

mem.used_cache
service=system
hostname
cluster

Mbytes of memory used for the page cache

mem.used_mb
service=system
hostname
cluster

Total Mbytes of used memory

Table 13.12: Network Metrics
Metric NameDimensionsDescription
net.in_bytes_sec
service=system
hostname
device

Number of network bytes received per second

net.in_errors_sec
service=system
hostname
device

Number of network errors on incoming network traffic per second

net.in_packets_dropped_sec
service=system
hostname
device

Number of inbound network packets dropped per second

net.in_packets_sec
service=system
hostname
device

Number of network packets received per second

net.out_bytes_sec
service=system
hostname
device

Number of network bytes sent per second

net.out_errors_sec
service=system
hostname
device

Number of network errors on outgoing network traffic per second

net.out_packets_dropped_sec
service=system
hostname
device

Number of outbound network packets dropped per second

net.out_packets_sec
service=system
hostname
device

Number of network packets sent per second

13.1.4.21 Zookeeper Metrics

A list of metrics associated with the Zookeeper service.

Metric NameDimensionsDescription
zookeeper.avg_latency_sec
hostname
mode
service=zookeeper
Average latency in second
zookeeper.connections_count
hostname
mode
service=zookeeper
Number of connections
zookeeper.in_bytes
hostname
mode
service=zookeeper
Received bytes
zookeeper.max_latency_sec
hostname
mode
service=zookeeper
Maximum latency in second
zookeeper.min_latency_sec
hostname
mode
service=zookeeper
Minimum latency in second
zookeeper.node_count
hostname
mode
service=zookeeper
Number of nodes
zookeeper.out_bytes
hostname
mode
service=zookeeper
Sent bytes
zookeeper.outstanding_bytes
hostname
mode
service=zookeeper
Outstanding bytes
zookeeper.zxid_count
hostname
mode
service=zookeeper
Count number
zookeeper.zxid_epoch
hostname
mode
service=zookeeper
Epoch number

13.2 Centralized Logging Service

You can use the Centralized Logging Service to evaluate and troubleshoot your distributed cloud environment from a single location.

13.2.1 Getting Started with Centralized Logging Service

A typical cloud consists of multiple servers which makes locating a specific log from a single server difficult. The Centralized Logging feature helps the administrator evaluate and troubleshoot the distributed cloud deployment from a single location.

The Logging API is a component in the centralized logging architecture. It works between log producers and log storage. In most cases it works by default after installation with no additional configuration. To use Logging API with logging-as-a-service, you must configure an end-point. This component adds flexibility and supportability for features in the future.

Do I need to Configure monasca-log-api? If you are only using Cloud Lifecycle Manager , then the default configuration is ready to use.

Important
Important

If you are using logging in any of the following deployments, then you will need to query keystone to get an end-point to use.

  • Logging as a Service

  • Platform as a Service

The Logging API is protected by keystone’s role-based access control. To ensure that logging is allowed and monasca alarms can be triggered, the user must have the monasca-user role. To get an end-point from keystone:

  1. Log on to Cloud Lifecycle Manager (deployer node).

  2. To list the Identity service catalog, run:

    ardana > source ./service.osrc
    ardana > openstack catalog list
  3. In the output, find Kronos. For example:

    NameTypeEndpoints
    kronosregion0

    public: http://myardana.test:5607/v3.0, admin: http://192.168.245.5:5607/v3.0, internal: http://192.168.245.5:5607/v3.0

  4. Use the same port number as found in the output. In the example, you would use port 5607.

In SUSE OpenStack Cloud, the logging-ansible restart playbook has been updated to manage the start,stop, and restart of the Centralized Logging Service in a specific way. This change was made to ensure the proper stop, start, and restart of Elasticsearch.

Important
Important

It is recommended that you only use the logging playbooks to perform the start, stop, and restart of the Centralized Logging Service. Manually mixing the start, stop, and restart operations with the logging playbooks will result in complex failures.

For more information, see Section 13.2.4, “Managing the Centralized Logging Feature”.

13.2.1.1 For More Information

For more information about the centralized logging components, see the following sites:

13.2.2 Understanding the Centralized Logging Service

The Centralized Logging feature collects logs on a central system, rather than leaving the logs scattered across the network. The administrator can use a single Kibana interface to view log information in charts, graphs, tables, histograms, and other forms.

13.2.2.1 What Components are Part of Centralized Logging?

Centralized logging consists of several components, detailed below:

  • Administrator's Browser:  Operations Console can be used to access logging alarms or to access Kibana's dashboards to review logging data.

  • Apache Website for Kibana:  A standard Apache website that proxies web/REST requests to the Kibana NodeJS server.

  • Beaver:  A Python daemon that collects information in log files and sends it to the Logging API (monasca-log API) over a secure connection.

  • Cloud Auditing Data Federation (CADF):  Defines a standard, full-event model anyone can use to fill in the essential data needed to certify, self-manage and self-audit application security in cloud environments.

  • Centralized Logging and Monitoring (CLM):  Used to evaluate and troubleshoot your SUSE OpenStack Cloud distributed cloud environment from a single location.

  • Curator: a tool provided by Elasticsearch to manage indices.

  • Elasticsearch:  A data store offering fast indexing and querying.

  • SUSE OpenStack Cloud Provides public, private, and managed cloud solutions to get you moving on your cloud journey.

  • JavaScript Object Notation (JSON) log file:  A file stored in the JSON format and used to exchange data. JSON uses JavaScript syntax, but the JSON format is text only. Text can be read and used as a data format by any programming language. This format is used by the Beaver and Logstash components.

  • Kafka:  A messaging broker used for collection of SUSE OpenStack Cloud centralized logging data across nodes. It is highly available, scalable and performant. Kafka stores logs in disk instead of memory and is therefore more tolerant to consumer down times.

    Important
    Important

    Make sure not to undersize your Kafka partition or the data retention period may be lower than expected. If the Kafka partition capacity is lower than 85%, the retention period will increase to 30 minutes. Over time Kafka will also eject old data.

  • Kibana:  A client/server application with rich dashboards to visualize the data in Elasticsearch through a web browser. Kibana enables you to create charts and graphs using the log data.

  • Logging API (monasca-log-api): SUSE OpenStack Cloud API provides a standard REST interface to store logs. It uses keystone authentication and role-based access control support.

  • Logstash:  A log processing system for receiving, processing and outputting logs. Logstash retrieves logs from Kafka, processes and enriches the data, then stores the data in Elasticsearch.

  • MML Service Node:  Metering, Monitoring, and Logging (MML) service node. All services associated with metering, monitoring, and logging run on a dedicated three-node cluster. Three nodes are required for high availability with quorum.

  • Monasca:  OpenStack monitoring at scale infrastructure for the cloud that supports alarms and reporting.

  • OpenStack Service.  An OpenStack service process that requires logging services.

  • Oslo.log.  An OpenStack library for log handling. The library functions automate configuration, deployment and scaling of complete, ready-for-work application platforms. Some PaaS solutions, such as Cloud Foundry, combine operating systems, containers, and orchestrators with developer tools, operations utilities, metrics, and security to create a developer-rich solution.

  • Text log:  A type of file used in the logging process that contains human-readable records.

These components are configured to work out-of-the-box and the admin should be able to view log data using the default configurations.

In addition to each of the services, Centralized Logging also processes logs for the following features:

  • HAProxy

  • Syslog

  • keepalived

The purpose of the logging service is to provide a common logging infrastructure with centralized user access. Since there are numerous services and applications running in each node of a SUSE OpenStack Cloud cloud, and there could be hundreds of nodes, all of these services and applications can generate enough log files to make it very difficult to search for specific events in log files across all of the nodes. Centralized Logging addresses this issue by sending log messages in real time to a central Elasticsearch, Logstash, and Kibana cluster. In this cluster they are indexed and organized for easier and visual searches. The following illustration describes the architecture used to collect operational logs.

Image
Note
Note

The arrows come from the active (requesting) side to the passive (listening) side. The active side is always the one providing credentials, so the arrows may also be seen as coming from the credential holder to the application requiring authentication.

13.2.2.2 Steps 1- 2

Services configured to generate log files record the data. Beaver listens for changes to the files and sends the log files to the Logging Service. The first step the Logging service takes is to re-format the original log file to a new log file with text only and to remove all network operations. In Step 1a, the Logging service uses the Oslo.log library to re-format the file to text-only. In Step 1b, the Logging service uses the Python-Logstash library to format the original audit log file to a JSON file.

Step 1a

Beaver watches configured service operational log files for changes and reads incremental log changes from the files.

Step 1b

Beaver watches configured service operational log files for changes and reads incremental log changes from the files.

Step 2a

The monascalog transport of Beaver makes a token request call to keystone passing in credentials. The token returned is cached to avoid multiple network round-trips.

Step 2b

The monascalog transport of Beaver batches multiple logs (operational or audit) and posts them to the monasca-log-api VIP over a secure connection. Failure logs are written to the local Beaver log.

Step 2c

The REST API client for monasca-log-api makes a token-request call to keystone passing in credentials. The token returned is cached to avoid multiple network round-trips.

Step 2d

The REST API client for monasca-log-api batches multiple logs (operational or audit) and posts them to the monasca-log-api VIP over a secure connection.

13.2.2.3 Steps 3a- 3b

The Logging API (monasca-log API) communicates with keystone to validate the incoming request, and then sends the logs to Kafka.

Step 3a

The monasca-log-api WSGI pipeline is configured to validate incoming request tokens with keystone. The keystone middleware used for this purpose is configured to use the monasca-log-api admin user, password and project that have the required keystone role to validate a token.

Step 3b

monasca-log-api sends log messages to Kafka using a language-agnostic TCP protocol.

13.2.2.4 Steps 4- 8

Logstash pulls messages from Kafka, identifies the log type, and transforms the messages into either the audit log format or operational format. Then Logstash sends the messages to Elasticsearch, using either an audit or operational indices.

Step 4

Logstash input workers pull log messages from the Kafka-Logstash topic using TCP.

Step 5

This Logstash filter processes the log message in-memory in the request pipeline. Logstash identifies the log type from this field.

Step 6

This Logstash filter processes the log message in-memory in the request pipeline. If the message is of audit-log type, Logstash transforms it from the monasca-log-api envelope format to the original CADF format.

Step 7

This Logstash filter determines which index should receive the log message. There are separate indices in Elasticsearch for operational versus audit logs.

Step 8

Logstash output workers write the messages read from Kafka to the daily index in the local Elasticsearch instance.

13.2.2.5 Steps 9- 12

When an administrator who has access to the guest network accesses the Kibana client and makes a request, Apache forwards the request to the Kibana NodeJS server. Then the server uses the Elasticsearch REST API to service the client requests.

Step 9

An administrator who has access to the guest network accesses the Kibana client to view and search log data. The request can originate from the external network in the cloud through a tenant that has a pre-defined access route to the guest network.

Step 10

An administrator who has access to the guest network uses a web browser and points to the Kibana URL. This allows the user to search logs and view Dashboard reports.

Step 11

The authenticated request is forwarded to the Kibana NodeJS server to render the required dashboard, visualization, or search page.

Step 12

The Kibana NodeJS web server uses the Elasticsearch REST API in localhost to service the UI requests.

13.2.2.6 Steps 13- 15

Log data is backed-up and deleted in the final steps.

Step 13

A daily cron job running in the ELK node runs curator to prune old Elasticsearch log indices.

Step 14

The curator configuration is done at the deployer node through the Ansible role logging-common. Curator is scripted to then prune or clone old indices based on this configuration.

Step 15

The audit logs must be backed up manually. For more information about Backup and Recovery, see Chapter 17, Backup and Restore.

13.2.2.7 How Long are Log Files Retained?

The logs that are centrally stored are saved to persistent storage as Elasticsearch indices. These indices are stored in the partition /var/lib/elasticsearch on each of the Elasticsearch cluster nodes. Out of the box, logs are stored in one Elasticsearch index per service. As more days go by, the number of indices stored in this disk partition grows. Eventually the partition fills up. If they are open, each of these indices takes up CPU and memory. If these indices are left unattended they will continue to consume system resources and eventually deplete them.

Elasticsearch, by itself, does not prevent this from happening.

SUSE OpenStack Cloud uses a tool called curator that is developed by the Elasticsearch community to handle these situations. SUSE OpenStack Cloud installs and uses a curator in conjunction with several configurable settings. This curator is called by cron and performs the following checks:

  • First Check. The hourly cron job checks to see if the currently used Elasticsearch partition size is over the value set in:

    curator_low_watermark_percent

    If it is higher than this value, the curator deletes old indices according to the value set in:

    curator_num_of_indices_to_keep
  • Second Check. Another check is made to verify if the partition size is below the high watermark percent. If it is still too high, curator will delete all indices except the current one that is over the size as set in:

    curator_max_index_size_in_gb
  • Third Check. A third check verifies if the partition size is still too high. If it is, curator will delete all indices except the current one.

  • Final Check. A final check verifies if the partition size is still high. If it is, an error message is written to the log file but the current index is NOT deleted.

In the case of an extreme network issue, log files can run out of disk space in under an hour. To avoid this SUSE OpenStack Cloud uses a shell script called logrotate_if_needed.sh. The cron process runs this script every 5 minutes to see if the size of /var/log has exceeded the high_watermark_percent (95% of the disk, by default). If it is at or above this level, logrotate_if_needed.sh runs the logrotate script to rotate logs and to free up extra space. This script helps to minimize the chance of running out of disk space on /var/log.

13.2.2.8 How Are Logs Rotated?

SUSE OpenStack Cloud uses the cron process which in turn calls Logrotate to provide rotation, compression, and removal of log files. Each log file can be rotated hourly, daily, weekly, or monthly. If no rotation period is set then the log file will only be rotated when it grows too large.

Rotating a file means that the Logrotate process creates a copy of the log file with a new extension, for example, the .1 extension, then empties the contents of the original file. If a .1 file already exists, then that file is first renamed with a .2 extension. If a .2 file already exists, it is renamed to .3, etc., up to the maximum number of rotated files specified in the settings file. When Logrotate reaches the last possible file extension, it will delete the last file first on the next rotation. By the time the Logrotate process needs to delete a file, the results will have been copied to Elasticsearch, the central logging database.

The log rotation setting files can be found in the following directory

~/scratch/ansible/next/ardana/ansible/roles/logging-common/vars

These files allow you to set the following options:

Service

The name of the service that creates the log entries.

Rotated Log Files

List of log files to be rotated. These files are kept locally on the server and will continue to be rotated. If the file is also listed as Centrally Logged, it will also be copied to Elasticsearch.

Frequency

The timing of when the logs are rotated. Options include:hourly, daily, weekly, or monthly.

Max Size

The maximum file size the log can be before it is rotated out.

Rotation

The number of log files that are rotated.

Centrally Logged Files

These files will be indexed by Elasticsearch and will be available for searching in the Kibana user interface.

Only files that are listed in the Centrally Logged Files section are copied to Elasticsearch.

All of the variables for the Logrotate process are found in the following file:

~/scratch/ansible/next/ardana/ansible/roles/logging-ansible/logging-common/defaults/main.yml

Cron runs Logrotate hourly. Every 5 minutes another process is run called "logrotate_if_needed" which uses a watermark value to determine if the Logrotate process needs to be run. If the "high watermark" has been reached, and the /var/log partition is more than 95% full (by default - this can be adjusted), then Logrotate will be run within 5 minutes.

13.2.2.9 Are Log Files Backed-Up To Elasticsearch?

While centralized logging is enabled out of the box, the backup of these logs is not. The reason is because Centralized Logging relies on the Elasticsearch FileSystem Repository plugin, which in turn requires shared disk partitions to be configured and accessible from each of the Elasticsearch nodes. Since there are multiple ways to setup a shared disk partition, SUSE OpenStack Cloud allows you to choose an approach that works best for your deployment before enabling the back-up of log files to Elasticsearch.

If you enable automatic back-up of centralized log files, then all the logs collected from the cloud nodes will be backed-up to Elasticsearch. Every hour, in the management controller nodes where Elasticsearch is setup, a cron job runs to check if Elasticsearch is running low on disk space. If the check succeeds, it further checks if the backup feature is enabled. If enabled, the cron job saves a snapshot of the Elasticsearch indices to the configured shared disk partition using curator. Next, the script starts deleting the oldest index and moves down from there checking each time if there is enough space for Elasticsearch. A check is also made to ensure that the backup runs only once a day.

For steps on how to enable automatic back-up, see Section 13.2.5, “Configuring Centralized Logging”.

13.2.3 Accessing Log Data

All logging data in SUSE OpenStack Cloud is managed by the Centralized Logging Service and can be viewed or analyzed by Kibana. Kibana is the only graphical interface provided with SUSE OpenStack Cloud to search or create a report from log data. Operations Console provides only a link to the Kibana Logging dashboard.

The following two methods allow you to access the Kibana Logging dashboard to search log data:

To learn more about Kibana, read the Getting Started with Kibana guide.

13.2.3.1 Use the Operations Console Link

Operations Console allows you to access Kibana in the same tool that you use to manage the other SUSE OpenStack Cloud resources in your deployment. To use Operations Console, you must have the correct permissions.

To use Operations Console:

  1. In a browser, open the Operations Console.

  2. On the login page, enter the user name, and the Password, and then click LOG IN.

  3. On the Home/Central Dashboard page, click the menu represented by 3 horizontal lines (Three-Line Icon).

  4. From the menu that slides in on the left, select Home, and then select Logging.

  5. On the Home/Logging page, click View Logging Dashboard.

Important
Important

In SUSE OpenStack Cloud, Kibana usually runs on a different network than Operations Console. Due to this configuration, it is possible that using Operations Console to access Kibana will result in an “404 not found” error. This error only occurs if the user has access only to the public facing network.

13.2.3.2 Using Kibana to Access Log Data

Kibana is an open-source, data-visualization plugin for Elasticsearch. Kibana provides visualization capabilities using the log content indexed on an Elasticsearch cluster. Users can create bar and pie charts, line and scatter plots, and maps using the data collected by SUSE OpenStack Cloud in the cloud log files.

While creating Kibana dashboards is beyond the scope of this document, it is important to know that the dashboards you create are JSON files that you can modify or create new dashboards based on existing dashboards.

Note
Note

Kibana is client-server software. To operate properly, the browser must be able to access port 5601 on the control plane.

FieldDefault ValueDescription
userkibana

Username that will be required for logging into the Kibana UI.

passwordrandom password is generated

Password generated during installation that is used to login to the Kibana UI.

13.2.3.3 Logging into Kibana

To log into Kibana to view data, you must make sure you have the required login configuration.

13.2.3.3.1 Verify Login Credentials

During the installation of Kibana, a password is automatically set and it is randomized. Therefore, unless an administrator has already changed it, you need to retrieve the default password from a file on the control plane node.

13.2.3.3.2 Find the Randomized Password
  1. To find the Kibana password, run:

    ardana > grep kibana ~/scratch/ansible/next/my_cloud/stage/internal/CloudModel.yaml

13.2.4 Managing the Centralized Logging Feature

No specific configuration tasks are required to use Centralized Logging, as it is enabled by default after installation. However, you can configure the individual components as needed for your environment.

13.2.4.1 How Do I Stop and Start the Logging Service?

Although you might not need to stop and start the logging service very often, you may need to if, for example, one of the logging services is not behaving as expected or not working.

You cannot enable or disable centralized logging across all services unless you stop all centralized logging. Instead, it is recommended that you enable or disable individual log files in the <service>-clr.yml files and then reconfigure logging. You would enable centralized logging for a file when you want to make sure you are able to monitor those logs in Kibana.

In SUSE OpenStack Cloud, the logging-ansible restart playbook has been updated to manage the start,stop, and restart of the Centralized Logging Service in a specific way. This change was made to ensure the proper stop, start, and restart of Elasticsearch.

Important
Important

It is recommended that you only use the logging playbooks to perform the start, stop, and restart of the Centralized Logging Service. Manually mixing the start, stop, and restart operations with the logging playbooks will result in complex failures.

The steps in this section only impact centralized logging. Logrotate is an essential feature that keeps the service log files from filling the disk and will not be affected.

Important
Important

These playbooks must be run from the Cloud Lifecycle Manager.

To stop the Logging service:

  1. To change to the directory containing the ansible playbook, run

    ardana > cd ~/scratch/ansible/next/ardana/ansible
  2. To run the ansible playbook that will stop the logging service, run:

    ardana > ansible-playbook -i hosts/verb_hosts logging-stop.yml

To start the Logging service:

  1. To change to the directory containing the ansible playbook, run

    ardana > cd ~/scratch/ansible/next/ardana/ansible
  2. To run the ansible playbook that will stop the logging service, run:

    ardana > ansible-playbook -i hosts/verb_hosts logging-start.yml

13.2.4.2 How Do I Enable or Disable Centralized Logging For a Service?

To enable or disable Centralized Logging for a service you need to modify the configuration for the service, set the enabled flag to true or false, and then reconfigure logging.

Important
Important

There are consequences if you enable too many logging files for a service. If there is not enough storage to support the increased logging, the retention period of logs in Elasticsearch is decreased. Alternatively, if you wanted to increase the retention period of log files or if you did not want those logs to show up in Kibana, you would disable centralized logging for a file.

To enable Centralized Logging for a service:

  1. Use the documentation provided with the service to ensure it is not configured for logging.

  2. To find the SUSE OpenStack Cloud file to edit, run:

    ardana > find ~/openstack/my_cloud/config/logging/vars/ -name "*service-name*"
  3. Edit the file for the service for which you want to enable logging.

  4. To enable Centralized Logging, find the following code and change the enabled flag to true, to disable, change the enabled flag to false:

    logging_options:
     - centralized_logging:
            enabled: true
            format: json
  5. Save the changes to the file.

  6. To commit the changes to your local git repository:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  7. To reconfigure logging, run:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts kronos-reconfigure.yml
    ardana > cd ~/openstack/ardana/ansible/
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml

13.2.5 Configuring Centralized Logging

You can adjust the settings for centralized logging when you are troubleshooting problems with a service or to decrease log size and retention to save on disk space. For steps on how to configure logging settings, refer to the following tasks:

13.2.5.1 Configuration Files

Centralized Logging settings are stored in the configuration files in the following directory on the Cloud Lifecycle Manager: ~/openstack/my_cloud/config/logging/

The configuration files and their use are described below:

FileDescription
main.ymlMain configuration file for all centralized logging components.
elasticsearch.yml.j2Main configuration file for Elasticsearch.
elasticsearch-default.j2Default overrides for the Elasticsearch init script.
kibana.yml.j2Main configuration file for Kibana.
kibana-apache2.conf.j2Apache configuration file for Kibana.
logstash.conf.j2Logstash inputs/outputs configuration.
logstash-default.j2Default overrides for the Logstash init script.
beaver.conf.j2Main configuration file for Beaver.
varsPath to logrotate configuration files.

13.2.5.2 Planning Resource Requirements

The Centralized Logging service needs to have enough resources available to it to perform adequately for different scale environments. The base logging levels are tuned during installation according to the amount of RAM allocated to your control plane nodes to ensure optimum performance.

These values can be viewed and changed in the ~/openstack/my_cloud/config/logging/main.yml file, but you will need to run a reconfigure of the Centralized Logging service if changes are made.

Warning
Warning

The total process memory consumption for Elasticsearch will be the above allocated heap value (in ~/openstack/my_cloud/config/logging/main.yml) plus any Java Virtual Machine (JVM) overhead.

Setting Disk Size Requirements

In the entry-scale models, the disk partition sizes on your controller nodes for the logging and Elasticsearch data are set as a percentage of your total disk size. You can see these in the following file on the Cloud Lifecycle Manager (deployer): ~/openstack/my_cloud/definition/data/<controller_disk_files_used>

Sample file settings:

# Local Log files.
- name: log
  size: 13%
  mount: /var/log
  fstype: ext4
  mkfs-opts: -O large_file

# Data storage for centralized logging. This holds log entries from all
# servers in the cloud and hence can require a lot of disk space.
- name: elasticsearch
  size: 30%
  mount: /var/lib/elasticsearch
  fstype: ext4
Important
Important

The disk size is set automatically based on the hardware configuration. If you need to adjust it, you can set it manually with the following steps.

To set disk sizes:

  1. Log in to the Cloud Lifecycle Manager (deployer).

  2. Open the following file:

    ~/openstack/my_cloud/definition/data/disks.yml
  3. Make any desired changes.

  4. Save the changes to the file.

  5. To commit the changes to your local git repository:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A git
    ardana > git commit -m "My config or other commit message"
  6. To run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  7. To create a deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  8. To run the logging reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts kronos-reconfigure.yml

13.2.5.3 Backing Up Elasticsearch Log Indices

The log files that are centrally collected in SUSE OpenStack Cloud are stored by Elasticsearch on disk in the /var/lib/elasticsearch partition. However, this is distributed across each of the Elasticsearch cluster nodes as shards. A cron job runs periodically to see if the disk partition runs low on space, and, if so, it runs curator to delete the old log indices to make room for new logs. This deletion is permanent and the logs are lost forever. If you want to backup old logs, for example to comply with certain regulations, you can configure automatic backup of Elasticsearch indices.

Important
Important

If you need to restore data that was archived prior to SUSE OpenStack Cloud 9 and used the older versions of Elasticsearch, then this data will need to be restored to a separate deployment of Elasticsearch.

This can be accomplished using the following steps:

  1. Deploy a separate distinct Elasticsearch instance version matching the version in SUSE OpenStack Cloud.

  2. Configure the backed-up data using NFS or some other share mechanism to be available to the Elasticsearch instance matching the version in SUSE OpenStack Cloud.

Before enabling automatic back-ups, make sure you understand how much disk space you will need, and configure the disks that will store the data. Use the following checklist to prepare your deployment for enabling automatic backups:

Item

Add a shared disk partition to each of the Elasticsearch controller nodes.

The default partition name used for backup is

/var/lib/esbackup

You can change this by:

  1. Open the following file: my_cloud/config/logging/main.yml

  2. Edit the following variable curator_es_backup_partition

Ensure the shared disk has enough storage to retain backups for the desired retention period.

To enable automatic back-up of centralized logs to Elasticsearch:

  1. Log in to the Cloud Lifecycle Manager (deployer node).

  2. Open the following file in a text editor:

    ~/openstack/my_cloud/config/logging/main.yml
  3. Find the following variables:

    curator_backup_repo_name: "es_{{host.my_dimensions.cloud_name}}"
    curator_es_backup_partition: /var/lib/esbackup
  4. To enable backup, change the curator_enable_backup value to true in the curator section:

    curator_enable_backup: true
  5. Save your changes and re-run the configuration processor:

    ardana > cd ~/openstack
    ardana > git add -A
    # Verify the added files
    ardana > git status
    ardana > git commit -m "Enabling Elasticsearch Backup"
    
    $ cd ~/openstack/ardana/ansible
    $ ansible-playbook -i hosts/localhost config-processor-run.yml
    $ ansible-playbook -i hosts/localhost ready-deployment.yml
  6. To re-configure logging:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts kronos-reconfigure.yml
  7. To verify that the indices are backed up, check the contents of the partition:

    ardana > ls /var/lib/esbackup

13.2.5.4 Restoring Logs From an Elasticsearch Backup

To restore logs from an Elasticsearch backup, see https://www.elastic.co/guide/en/elasticsearch/reference/2.4/modules-snapshots.html.

Note
Note

We do not recommend restoring to the original SUSE OpenStack Cloud Centralized Logging cluster as it may cause storage/capacity issues. We rather recommend setting up a separate ELK cluster of the same version and restoring the logs there.

13.2.5.5 Tuning Logging Parameters

When centralized logging is installed in SUSE OpenStack Cloud, parameters for Elasticsearch heap size and logstash heap size are automatically configured based on the amount of RAM on the system. These values are typically the required values, but they may need to be adjusted if performance issues arise, or disk space issues are encountered. These values may also need to be adjusted if hardware changes are made after an installation.

These values are defined at the top of the following file .../logging-common/defaults/main.yml. An example of the contents of the file is below:

1. Select heap tunings based on system RAM
#-------------------------------------------------------------------------------
threshold_small_mb: 31000
threshold_medium_mb: 63000
threshold_large_mb: 127000
tuning_selector: " {% if ansible_memtotal_mb < threshold_small_mb|int %}
demo
{% elif ansible_memtotal_mb < threshold_medium_mb|int %}
small
{% elif ansible_memtotal_mb < threshold_large_mb|int %}
medium
{% else %}
large
{%endif %}
"

logging_possible_tunings:
2. RAM < 32GB
demo:
elasticsearch_heap_size: 512m
logstash_heap_size: 512m
3. RAM < 64GB
small:
elasticsearch_heap_size: 8g
logstash_heap_size: 2g
4. RAM < 128GB
medium:
elasticsearch_heap_size: 16g
logstash_heap_size: 4g
5. RAM >= 128GB
large:
elasticsearch_heap_size: 31g
logstash_heap_size: 8g
logging_tunings: "{{ logging_possible_tunings[tuning_selector] }}"

This specifies thresholds for what a small, medium, or large system would look like, in terms of memory. To see what values will be used, see what RAM your system uses, and see where it fits in with the thresholds to see what values you will be installed with. To modify the values, you can either adjust the threshold values so that your system will change from a small configuration to a medium configuration, for example, or keep the threshold values the same, and modify the heap_size variables directly for the selector that your system is set for. For example, if your configuration is a medium configuration, which sets heap_sizes to 16 GB for Elasticsearch and 4 GB for logstash, and you want twice as much set aside for logstash, then you could increase the 4 GB for logstash to 8 GB.

13.2.6 Configuring Settings for Other Services

When you configure settings for the Centralized Logging Service, those changes impact all services that are enabled for centralized logging. However, if you only need to change the logging configuration for one specific service, you will want to modify the service's files instead of changing the settings for the entire Centralized Logging service. This topic helps you complete the following tasks:

13.2.6.1 Setting Logging Levels for Services

When it is necessary to increase the logging level for a specific service to troubleshoot an issue, or to decrease logging levels to save disk space, you can edit the service's config file and then reconfigure logging. All changes will be made to the service's files and not to the Centralized Logging service files.

Messages only appear in the log files if they are the same as or more severe than the log level you set. The DEBUG level logs everything. Most services default to the INFO logging level, which lists informational events, plus warnings, errors, and critical errors. Some services provide other logging options which will narrow the focus to help you debug an issue, receive a warning if an operation fails, or if there is a serious issue with the cloud.

For more information on logging levels, see the OpenStack Logging Guidelines documentation.

13.2.6.2 Configuring the Logging Level for a Service

If you want to increase or decrease the amount of details that are logged by a service, you can change the current logging level in the configuration files. Most services support, at a minimum, the DEBUG and INFO logging levels. For more information about what levels are supported by a service, check the documentation or Website for the specific service.

13.2.6.3 Barbican

ServiceSub-componentSupported Logging Levels
barbican

barbican-api

barbican-worker

INFO (default)

DEBUG

To change the barbican logging level:

  1. Log in to the Cloud Lifecycle Manager (deployer).

  2. Open the following file:

    ardana > cd ~/openstack/
    ardana > vi my_cloud/config/barbican/barbican_deploy_config.yml
  3. To change the logging level, replace the values in these lines with the desired threshold (in ALL CAPS) for the standard log file on disk and the JSON log entries forwarded to centralized log services.

    barbican_loglevel: "{{ ardana_loglevel | default('INFO') }}"
    barbican_logstash_loglevel: "{{ ardana_loglevel | default('INFO') }}"
  4. Save the changes to the file.

  5. To commit the changes to your local git repository:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  6. To run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  7. To create a deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  8. Run the reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts barbican-reconfigure.yml

13.2.6.4 Block Storage (cinder)

ServiceSub-componentSupported Logging Levels
cinder

cinder-api

cinder-scheduler

cinder-backup

cinder-volume

INFO (default)

DEBUG

To manage cinder logging:

  1. Log in to the Cloud Lifecycle Manager (deployer).

  2. Open the following file:

    ardana > cd ~/openstack/ardana/ansible
    ardana > vi roles/_CND-CMN/defaults/main.yml
  3. To change the logging level, replace the values in these lines with the desired threshold (in ALL CAPS) for the standard log file on disk and the JSON log entries forwarded to centralized log services.

    cinder_loglevel: {{ ardana_loglevel | default('INFO') }}
    cinder_logstash_loglevel: {{ ardana_loglevel | default('INFO') }}
  4. Save the changes to the file.

  5. To commit the changes to your local git repository:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  6. To run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  7. To create a deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  8. To run the reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts cinder-reconfigure.yml

13.2.6.5 Ceilometer

ServiceSub-componentSupported Logging Levels
ceilometer

ceilometer-collector

ceilometer-agent-notification

ceilometer-polling

ceilometer-expirer

INFO (default)

DEBUG

To change the ceilometer logging level:

  1. Log in to the Cloud Lifecycle Manager (deployer).

  2. Open the following file:

    ardana > cd ~/openstack/ardana/ansible
    ardana > vi roles/_CEI-CMN/defaults/main.yml
  3. To change the logging level, replace the values in these lines with the desired threshold (in ALL CAPS) for the standard log file on disk and the JSON log entries forwarded to centralized log services.

    ceilometer_loglevel:  INFO
    ceilometer_logstash_loglevel:  INFO
  4. Save the changes to the file.

  5. To commit the changes to your local git repository:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  6. To run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  7. To create a deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  8. To run the reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ceilometer-reconfigure.yml

13.2.6.6 Compute (nova)

ServiceSub-componentSupported Logging Levels
nova 

INFO (default)

DEBUG

To change the nova logging level:

  1. Log in to the Cloud Lifecycle Manager.

  2. The nova service component logging can be changed by modifying the following files:

    ~/openstack/my_cloud/config/nova/novncproxy-logging.conf.j2
    ~/openstack/my_cloud/config/nova/api-logging.conf.j2
    ~/openstack/my_cloud/config/nova/compute-logging.conf.j2
    ~/openstack/my_cloud/config/nova/conductor-logging.conf.j2
    ~/openstack/my_cloud/config/nova/scheduler-logging.conf.j2
  3. The threshold for default text log files may be set by editing the [handler_watchedfile] section, or the JSON content forwarded to centralized logging may be set by editing the [handler_logstash] section. In either section, replace the value of the following line with the desired log level:

    level: INFO
  4. Save the changes to the file.

  5. To commit the changes to your local git repository:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  6. To run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  7. To create a deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  8. To run the reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts nova-reconfigure.yml

13.2.6.7 Designate (DNS)

ServiceSub-componentSupported Logging Levels
designate

designate-api

designate-central

designate-mdns

designate-producer

designate-worker

designate-pool-manager

designate-zone-manager

INFO (default)

DEBUG

To change the designate logging level:

  1. Log in to the Cloud Lifecycle Manager (deployer).

  2. Open the following file:

    ardana > cd ~/openstack/
    ardana > vi my_cloud/config/designate/designate.conf.j2
  3. To change the logging level, set the value of the following line:

    debug = False
  4. Save the changes to the file.

  5. To commit the changes to your local git repository:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  6. To run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  7. To create a deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  8. Run the reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts designate-reconfigure.yml

13.2.6.8 Identity (keystone)

ServiceSub-componentSupported Logging Levels
keystonekeystone

INFO (default)

DEBUG

WARN

ERROR

To change the keystone logging level:

  1. Log in to the Cloud Lifecycle Manager (deployer).

  2. Open the following file:

    ~/openstack/my_cloud/config/keystone/keystone_deploy_config.yml
  3. To change the logging level, replace the values in these lines with the desired threshold (in ALL CAPS) for the standard log file on disk and the JSON log entries forwarded to centralized log services.

    keystone_loglevel: INFO
    keystone_logstash_loglevel: INFO
  4. Save the changes to the file.

  5. To commit the changes to your local git repository:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  6. To run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  7. To create a deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  8. To run the reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts keystone-reconfigure.yml

13.2.6.9 Image (glance)

ServiceSub-componentSupported Logging Levels
glance

glance-api

INFO (default)

DEBUG

To change the glance logging level:

  1. Log in to the Cloud Lifecycle Manager (deployer).

  2. Open the following file:

    ~/openstack/my_cloud/config/glance/glance-api-logging.conf.j2
  3. The threshold for default text log files may be set by editing the [handler_watchedfile] section, or the JSON content forwarded to centralized logging may be set by editing the [handler_logstash] section. In either section, replace the value of the following line with the desired log level:

    level: INFO
  4. Save the changes to the file.

  5. To commit the changes to your local git repository:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  6. To run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  7. To create a deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  8. To run the reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts glance-reconfigure.yml

13.2.6.10 Bare Metal (ironic)

ServiceSub-componentSupported Logging Levels
ironic

ironic-api-logging.conf.j2

ironic-conductor-logging.conf.j2

INFO (default)

DEBUG

To change the ironic logging level:

  1. Log in to the Cloud Lifecycle Manager (deployer).

  2. Open the following file:

    ardana > cd ~/openstack/ardana/ansible
    ardana > vi roles/ironic-common/defaults/main.yml
  3. To change the logging level, replace the values in these lines with the desired threshold (in ALL CAPS) for the standard log file on disk and the JSON log entries forwarded to centralized log services.

    ironic_api_loglevel: "{{ ardana_loglevel | default('INFO') }}"
    ironic_api_logstash_loglevel: "{{ ardana_loglevel | default('INFO') }}"
    ironic_conductor_loglevel: "{{ ardana_loglevel | default('INFO') }}"
    ironic_conductor_logstash_loglevel: "{{ ardana_loglevel | default('INFO') }}"
  4. Save the changes to the file.

  5. To commit the changes to your local git repository:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  6. To run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  7. To create a deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  8. To run the reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ironic-reconfigure.yml

13.2.6.11 Monitoring (monasca)

ServiceSub-componentSupported Logging Levels
monasca

monasca-persister

zookeeper

storm

monasca-notification

monasca-api

kafka

monasca-agent

WARN (default)

INFO

To change the monasca logging level:

  1. Log in to the Cloud Lifecycle Manager (deployer).

  2. Monitoring service component logging can be changed by modifying the following files:

    ~/openstack/ardana/ansible/roles/monasca-persister/defaults/main.yml
    ~/openstack/ardana/ansible/roles/storm/defaults/main.yml
    ~/openstack/ardana/ansible/roles/monasca-notification/defaults/main.yml
    ~/openstack/ardana/ansible/roles/monasca-api/defaults/main.yml
    ~/openstack/ardana/ansible/roles/kafka/defaults/main.yml
    ~/openstack/ardana/ansible/roles/monasca-agent/defaults/main.yml (For this file, you will need to add the variable)
  3. To change the logging level, use ALL CAPS to set the desired level in the following line:

    monasca_log_level: WARN
  4. Save the changes to the file.

  5. To commit the changes to your local git repository:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  6. To run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  7. To create a deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  8. To run the reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-reconfigure.yml

13.2.6.12 Networking (neutron)

ServiceSub-componentSupported Logging Levels
neutron

neutron-server

dhcp-agent

l3-agent

metadata-agent

openvswitch-agent

ovsvapp-agent

sriov-agent

infoblox-ipam-agent

l2gateway-agent

INFO (default)

DEBUG

To change the neutron logging level:

  1. Log in to the Cloud Lifecycle Manager.

  2. The neutron service component logging can be changed by modifying the following files:

    ~/openstack/ardana/ansible/roles/neutron-common/templates/dhcp-agent-logging.conf.j2
    ~/openstack/ardana/ansible/roles/neutron-common/templates/infoblox-ipam-agent-logging.conf.j2
    ~/openstack/ardana/ansible/roles/neutron-common/templates/l2gateway-agent-logging.conf.j2
    ~/openstack/ardana/ansible/roles/neutron-common/templates/l3-agent-logging.conf.j2
    ~/openstack/ardana/ansible/roles/neutron-common/templates/logging.conf.j2
    ~/openstack/ardana/ansible/roles/neutron-common/templates/metadata-agent-logging.conf.j2
    ~/openstack/ardana/ansible/roles/neutron-common/templates/openvswitch-agent-logging.conf.j2
    ~/openstack/ardana/ansible/roles/neutron-common/templates/ovsvapp-agent-logging.conf.j2
    ~/openstack/ardana/ansible/roles/neutron-common/templates/sriov-agent-logging.conf.j2
  3. The threshold for default text log files may be set by editing the [handler_watchedfile] section, or the JSON content forwarded to centralized logging may be set by editing the [handler_logstash] section. In either section, replace the value of the following line with the desired log level:

    level: INFO
  4. Save the changes to the file.

  5. To commit the changes to your local git repository:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  6. To run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  7. To create a deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  8. To run the reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts neutron-reconfigure.yml

13.2.6.13 Object Storage (swift)

ServiceSub-componentSupported Logging Levels
swift 

INFO (default)

DEBUG

Note
Note

Currently it is not recommended to log at any level other than INFO.

13.2.6.14 Octavia

ServiceSub-componentSupported Logging Levels
octavia

octavia-api

octavia-worker

octavia-hk

octavia-hm

INFO (default)

DEBUG

To change the Octavia logging level:

  1. Log in to the Cloud Lifecycle Manager.

  2. The Octavia service component logging can be changed by modifying the following files:

    ~/openstack/my_cloud/config/octavia/octavia-api-logging.conf.j2
    ~/openstack/my_cloud/config/octavia/octavia-worker-logging.conf.j2
    ~/openstack/my_cloud/config/octavia/octavia-hk-logging.conf.j2
    ~/openstack/my_cloud/config/octavia/octavia-hm-logging.conf.j2
  3. The threshold for default text log files may be set by editing the [handler_watchedfile] section, or the JSON content forwarded to centralized logging may be set by editing the [handler_logstash] section. In either section, replace the value of the following line with the desired log level:

    level: INFO
  4. Save the changes to the file.

  5. To commit the changes to your local git repository:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  6. To run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  7. To create a deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  8. To run the reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts octavia-reconfigure.yml

13.2.6.15 Operations Console

ServiceSub-componentSupported Logging Levels
opsconsole

ops-web

ops-mon

INFO (default)

DEBUG

To change the Operations Console logging level:

  1. Log in to the Cloud Lifecycle Manager.

  2. Open the following file:

    ~/openstack/ardana/ansible/roles/OPS-WEB/defaults/main.yml
  3. To change the logging level, use ALL CAPS to set the desired level in the following line:

    ops_console_loglevel: "{{ ardana_loglevel | default('INFO') }}"
  4. To commit the changes to your local git repository:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  5. To run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  6. To create a deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  7. To run the reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ops-console-reconfigure.yml

13.2.6.16 Orchestration (heat)

ServiceSub-componentSupported Logging Levels
heat

api-cfn

api

engine

INFO (default)

DEBUG

To change the heat logging level:

  1. Log in to the Cloud Lifecycle Manager (deployer).

  2. Open the following file:

    ~/openstack/my_cloud/config/heat/*-logging.conf.j2
  3. The threshold for default text log files may be set by editing the [handler_watchedfile] section, or the JSON content forwarded to centralized logging may be set by editing the [handler_logstash] section. In either section, replace the value of the following line with the desired log level:

    level: INFO
  4. Save the changes to the file.

  5. To commit the changes to your local git repository:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  6. To run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  7. To create a deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  8. To run the reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts heat-reconfigure.yml

13.2.6.17 Magnum

ServiceSub-componentSupported Logging Levels
magnum

api

conductor

INFO (default)

DEBUG

To change the Magnum logging level:

  1. Log in to the Cloud Lifecycle Manager (deployer).

  2. Open the following file:

    ~/openstack/my_cloud/config/magnum/api-logging.conf.j2
    ~/openstack/my_cloud/config/magnum/conductor-logging.conf.j2
  3. The threshold for default text log files may be set by editing the [handler_watchedfile] section, or the JSON content forwarded to centralized logging may be set by editing the [handler_logstash] section. In either section, replace the value of the following line with the desired log level:

    level: INFO
  4. Save the changes to the file.

  5. To commit the changes to your local git repository:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  6. To run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  7. To create a deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  8. To run the reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts magnum-reconfigure.yml

13.2.6.18 File Storage (manila)

ServiceSub-componentSupported Logging Levels
manila

api

INFO (default)

DEBUG

To change the manila logging level:

  1. Log in to the Cloud Lifecycle Manager (deployer).

  2. Open the following file:

    ~/openstack/my_cloud/config/manila/manila-logging.conf.j2
  3. To change the logging level, replace the values in these lines with the desired threshold (in ALL CAPS) for the standard log file on disk and the JSON log entries forwarded to centralized log services.

    manila_loglevel: INFO
    manila_logstash_loglevel: INFO
  4. Save the changes to the file.

  5. To commit the changes to your local git repository:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  6. To run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  7. To create a deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  8. To run the reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts manila-reconfigure.yml

13.2.6.19 Selecting Files for Centralized Logging

As you use SUSE OpenStack Cloud, you might find a need to redefine which log files are rotated on disk or transferred to centralized logging. These changes are all made in the centralized logging definition files.

SUSE OpenStack Cloud uses the logrotate service to provide rotation, compression, and removal of log files. All of the tunable variables for the logrotate process itself can be controlled in the following file: ~/openstack/ardana/ansible/roles/logging-common/defaults/main.yml

You can find the centralized logging definition files for each service in the following directory: ~/openstack/ardana/ansible/roles/logging-common/vars

You can change log settings for a service by following these steps.

  1. Log in to the Cloud Lifecycle Manager.

    Open the *.yml file for the service or sub-component that you want to modify.

    Using keystone, the Identity service as an example:

    ardana > vi ~/openstack/ardana/ansible/roles/logging-common/vars/keystone-clr.yml

    Consider the opening clause of the file:

    sub_service:
      hosts: KEY-API
      name: keystone
      service: keystone

    The hosts setting defines the role which will trigger this logrotate definition being applied to a particular host. It can use regular expressions for pattern matching, that is, NEU-.*.

    The service setting identifies the high-level service name associated with this content, which will be used for determining log files' collective quotas for storage on disk.

  2. Verify logging is enabled by locating the following lines:

    centralized_logging:
      enabled: true
      format: rawjson
    Note
    Note

    When possible, centralized logging is most effective on log files generated using logstash-formatted JSON. These files should specify format: rawjson. When only plaintext log files are available, format: json is appropriate. (This will cause their plaintext log lines to be wrapped in a json envelope before being sent to centralized logging storage.)

  3. Observe log files selected for rotation:

    - files:
      - /var/log/keystone/keystone.log
      log_rotate:
      - daily
      - maxsize 300M
      - rotate 7
      - compress
      - missingok
      - notifempty
      - copytruncate
      - create 640 keystone adm
    Note
    Note

    With the introduction of dynamic log rotation, the frequency (that is, daily) and file size threshold (that is, maxsize) settings no longer have any effect. The rotate setting may be easily overridden on a service-by-service basis.

  4. Commit any changes to your local git repository:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  5. Run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  6. Create a deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  7. Run the logging reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts kronos-reconfigure.yml

13.2.6.20 Controlling Disk Space Allocation and Retention of Log Files

Each service is assigned a weighted allocation of the /var/log filesystem's capacity. When all its log files' cumulative sizes exceed this allocation, a rotation is triggered for that service's log files according to the behavior specified in the /etc/logrotate.d/* specification.

These specification files are auto-generated based on YML sources delivered with the Cloud Lifecycle Manager codebase. The source files can be edited and reapplied to control the allocation of disk space across services or the behavior during a rotation.

Disk capacity is allocated as a percentage of the total weighted value of all services running on a particular node. For example, if 20 services run on the same node, all with a default weight of 100, they will each be granted 1/20th of the log filesystem's capacity. If the configuration is updated to change one service's weight to 150, all the services' allocations will be adjusted to make it possible for that one service to consume 150% of the space available to other individual services.

These policies are enforced by the script /opt/kronos/rotate_if_exceeded_quota.py, which will be executed every 5 minutes via a cron job and will rotate the log files of any services which have exceeded their respective quotas. When log rotation takes place for a service, logs are generated to describe the activity in /var/log/kronos/check_if_exceeded_quota.log.

When logrotate is performed on a service, its existing log files are compressed and archived to make space available for fresh log entries. Once the number of archived log files exceeds that service's retention thresholds, the oldest files are deleted. Thus, longer retention thresholds (that is, 10 to 15) will result in more space in the service's allocated log capacity being used for historic logs, while shorter retention thresholds (that is, 1 to 5) will keep more space available for its active plaintext log files.

Use the following process to make adjustments to services' log capacity allocations or retention thresholds:

  1. Navigate to the following directory on your Cloud Lifecycle Manager:

    ~/stack/scratch/ansible/next/ardana/ansible
  2. Open and edit the service weights file:

    ardana > vi roles/kronos-logrotation/vars/rotation_config.yml
  3. Edit the service parameters to set the desired parameters. Example:

    cinder:
      weight: 300
      retention: 2
    Note
    Note

    The retention setting of default will use recommend defaults for each services' log files.

  4. Run the kronos-logrotation-deploy playbook:

    ardana > ansible-playbook -i hosts/verb_hosts kronos-logrotation-deploy.yml
  5. Verify the changes to the quotas have been changed:

    Login to a node and check the contents of the file /opt/kronos/service_info.yml to see the active quotas for that node, and the specifications in /etc/logrotate.d/* for rotation thresholds.

13.2.6.21 Configuring Elasticsearch for Centralized Logging

Elasticsearch includes some tunable options exposed in its configuration. SUSE OpenStack Cloud uses these options in Elasticsearch to prioritize indexing speed over search speed. SUSE OpenStack Cloud also configures Elasticsearch for optimal performance in low RAM environments. The options that SUSE OpenStack Cloud modifies are listed below along with an explanation about why they were modified.

These configurations are defined in the ~/openstack/my_cloud/config/logging/main.yml file and are implemented in the Elasticsearch configuration file ~/openstack/my_cloud/config/logging/elasticsearch.yml.j2.

13.2.6.22 Safeguards for the Log Partitions Disk Capacity

Because the logging partitions are at a high risk of filling up over time, a condition which can cause many negative side effects on services running, it is important to safeguard against log files consuming 100 % of available capacity.

This protection is implemented by pairs of low/high watermark thresholds, with values established in ~/stack/scratch/ansible/next/ardana/ansible/roles/logging-common/defaults/main.yml and applied by the kronos-logrotation-deploy playbook.

  • var_log_low_watermark_percent (default: 80) sets a capacity level for the contents of the /var/log partition beyond which alarms will be triggered (visible to administrators in monasca).

  • var_log_high_watermark_percent (default: 95) defines how much capacity of the /var/log partition to make available for log rotation (in calculating weighted service allocations).

  • var_audit_low_watermark_percent (default: 80) sets a capacity level for the contents of the /var/audit partition beyond which alarm notifications will be triggered.

  • var_audit_high_watermark_percent (default: 95) sets a capacity level for the contents of the /var/audit partition which will cause log rotation to be forced according to the specification in /etc/auditlogrotate.conf.

13.2.7 Audit Logging Overview

Existing OpenStack service logging varies widely across services. Generally, log messages do not have enough detail about who is requesting the application program interface (API), or enough context-specific details about an action performed. Often details are not even consistently logged across various services, leading to inconsistent data formats being used across services. These issues make it difficult to integrate logging with existing audit tools and processes.

To help you monitor your workload and data in compliance with your corporate, industry or regional policies, SUSE OpenStack Cloud provides auditing support as a basic security feature. The audit logging can be integrated with customer Security Information and Event Management (SIEM) tools and support your efforts to correlate threat forensics.

The SUSE OpenStack Cloud audit logging feature uses Audit Middleware for Python services. This middleware service is based on OpenStack services which use the Paste Deploy system. Most OpenStack services use the paste deploy mechanism to find and configure WSGI servers and applications. Utilizing the paste deploy system provides auditing support in services with minimal changes.

By default, audit logging as a post-installation feature is disabled in the cloudConfig file on the Cloud Lifecycle Manager and it can only be enabled after SUSE OpenStack Cloud installation or upgrade.

The tasks in this section explain how to enable services for audit logging in your environment. SUSE OpenStack Cloud provides audit logging for the following services:

  • nova

  • barbican

  • keystone

  • cinder

  • ceilometer

  • neutron

  • glance

  • heat

For audit log backup information see Section 17.3.4, “Audit Log Backup and Restore”

13.2.7.1 Audit Logging Checklist

Before enabling audit logging, make sure you understand how much disk space you will need, and configure the disks that will store the logging data. Use the following table to complete these tasks:

13.2.7.1.1 Frequently Asked Questions
How are audit logs generated?

The audit logs are created by services running in the cloud management controller nodes. The events that create auditing entries are formatted using a structure that is compliant with Cloud Auditing Data Federation (CADF) policies. The formatted audit entries are then saved to disk files. For more information, see the Cloud Auditing Data Federation Website.

Where are audit logs stored?

We strongly recommend adding a dedicated disk volume for /var/audit.

If the disk templates for the controllers are not updated to create a separate volume for /var/audit, the audit logs will still be created in the root partition under the folder /var/audit. This could be problematic if the root partition does not have adequate space to hold the audit logs.

Warning
Warning

We recommend that you do not store audit logs in the /var/log volume. The /var/log volume is used for storing operational logs and logrotation/alarms have been preconfigured for various services based on the size of this volume. Adding audit logs here may impact these causing undesired alarms. This would also impact the retention times for the operational logs.

Are audit logs centrally stored?

Yes. The existing operational log profiles have been configured to centrally log audit logs as well, once their generation has been enabled. The audit logs will be stored in separate Elasticsearch indices separate from the operational logs.

How long are audit log files retained?

By default, audit logs are configured to be retained for 7 days on disk. The audit logs are rotated each day and the rotated files are stored in a compressed format and retained up to 7 days (configurable). The backup service has been configured to back up the audit logs to a location outside of the controller nodes for much longer retention periods.

Do I lose audit data if a management controller node goes down?

Yes. For this reason, it is strongly recommended that you back up the audit partition in each of the management controller nodes for protection against any data loss.

13.2.7.1.2 Estimate Disk Size

The table below provides estimates from each service of audit log size generated per day. The estimates are provided for environments with 100 nodes, 300 nodes, and 500 nodes.

Service

Log File Size: 100 nodes

Log File Size: 300 nodes

Log File Size: 500 nodes

barbican2.6 MB4.2 MB5.6 MB
keystone96 - 131 MB288 - 394 MB480 - 657 MB
nova186 (with a margin of 46) MB557 (with a margin of 139) MB928 (with a margin of 232) MB
ceilometer12 MB12 MB12 MB
cinder2 - 250 MB2 - 250 MB2 - 250 MB
neutron145 MB433 MB722 MB
glance20 (with a margin of 8) MB60 (with a margin of 22) MB100 (with a margin of 36) MB
heat432 MB (1 transaction per second)432 MB (1 transaction per second)432 MB (1 transaction per second)
swift33 GB (700 transactions per second)102 GB (2100 transactions per second)172 GB (3500 transactions per second)
13.2.7.1.3 Add disks to the controller nodes

You need to add disks for the audit log partition to store the data in a secure manner. The steps to complete this task will vary depending on the type of server you are running. Please refer to the manufacturer’s instructions on how to add disks for the type of server node used by the management controller cluster. If you already have extra disks in the controller node, you can identify any unused one and use it for the audit log partition.

13.2.7.1.4 Update the disk template for the controller nodes

Since audit logging is disabled by default, the audit volume groups in the disk templates are commented out. If you want to turn on audit logging, the template needs to be updated first. If it is not updated, there will be no back-up volume group. To update the disk template, you will need to copy templates from the examples folder to the definition folder and then edit the disk controller settings. Changes to the disk template used for provisioning cloud nodes must be made prior to deploying the nodes.

To update the disk controller template:

  1. Log in to your Cloud Lifecycle Manager.

  2. To copy the example templates folder, run the following command:

    Important
    Important

    If you already have the required templates in the definition folder, you can skip this step.

    ardana > cp -r ~/openstack/examples/entry-scale-esx/* ~/openstack/my_cloud/definition/
  3. To change to the data folder, run:

    ardana > cd ~/openstack/my_cloud/definition/
  4. To edit the disks controller settings, open the file that matches your server model and disk model in a text editor:

    ModelFile
    entry-scale-kvm
    disks_controller_1TB.yml
    disks_controller_600GB.yml
    mid-scale
    disks_compute.yml
    disks_control_common_600GB.yml
    disks_dbmq_600GB.yml
    disks_mtrmon_2TB.yml
    disks_mtrmon_4.5TB.yml
    disks_mtrmon_600GB.yml
    disks_swobj.yml
    disks_swpac.yml
  5. To update the settings and enable an audit log volume group, edit the appropriate file(s) listed above and remove the '#' comments from these lines, confirming that they are appropriate for your environment.

    - name: audit-vg
      physical-volumes:
        - /dev/sdz
      logical-volumes:
        - name: audit
          size: 95%
          mount: /var/audit
          fstype: ext4
          mkfs-opts: -O large_file
13.2.7.1.5 Save your changes

To save your changes you will use the GIT repository to add the setup disk files.

To save your changes:

  1. To change to the openstack directory, run:

    ardana > cd ~/openstack
  2. To add the new and updated files, run:

    ardana > git add -A
  3. To verify the files are added, run:

    ardana > git status
  4. To commit your changes, run:

    ardana > git commit -m "Setup disks for audit logging"

13.2.7.2 Enable Audit Logging

To enable audit logging you must edit your cloud configuration settings, save your changes and re-run the configuration processor. Then you can run the playbooks to create the volume groups and configure them.

In the ~/openstack/my_cloud/definition/cloudConfig.yml file, service names defined under enabled-services or disabled-services override the default setting.

The following is an example of your audit-settings section:

# Disc space needs to be allocated to the audit directory before enabling
# auditing.
# Default can be either "disabled" or "enabled". Services listed in
# "enabled-services" and "disabled-services" override the default setting.
audit-settings:
   default: disabled
   #enabled-services:
   #  - keystone
   #  - barbican
   disabled-services:
     - nova
     - barbican
     - keystone
     - cinder
     - ceilometer
     - neutron

In this example, although the default setting for all services is set to disabled, keystone and barbican may be explicitly enabled by removing the comments from these lines and this setting overrides the default.

13.2.7.2.1 To edit the configuration file:
  1. Log in to your Cloud Lifecycle Manager.

  2. To change to the cloud definition folder, run:

    ardana > cd ~/openstack/my_cloud/definition
  3. To edit the auditing settings, in a text editor, open the following file:

    cloudConfig.yml
  4. To enable audit logging, begin by uncommenting the "enabled-services:" block.

    • enabled-service:

    • any service you want to enable for audit logging.

    For example, keystone has been enabled in the following text:

    Default cloudConfig.yml fileEnabling keystone audit logging
    audit-settings:
    default: disabled
    enabled-services:
    #  - keystone
    audit-settings:
    default: disabled
    enabled-services:
      - keystone
  5. To move the services you want to enable, comment out the service in the disabled section and add it to the enabled section. For example, barbican has been enabled in the following text:

    cloudConfig.yml fileEnabling barbican audit logging
    audit-settings:
    default: disabled
    enabled-services:
      - keystone
    disabled-services:
       - nova
       # - keystone
       - barbican
       - cinder
    audit-settings:
    default: disabled
    enabled-services:
     - keystone
     - barbican
    disabled-services:
     - nova
     # - barbican
     # - keystone
     - cinder
13.2.7.2.2 To save your changes and run the configuration processor:
  1. To change to the openstack directory, run:

    ardana > cd ~/openstack
  2. To add the new and updated files, run:

    ardana > git add -A
  3. To verify the files are added, run:

    ardana > git status
  4. To commit your changes, run:

    ardana > git commit -m "Enable audit logging"
  5. To change to the directory with the ansible playbooks, run:

    ardana > cd ~/openstack/ardana/ansible
  6. To rerun the configuration processor, run:

    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
13.2.7.2.3 To create the volume group:
  1. To change to the directory containing the osconfig playbook, run:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
  2. To remove the stub file that osconfig uses to decide if the disks are already configured, run:

    ardana > ansible -i hosts/verb_hosts KEY-API -a 'sudo rm -f /etc/openstack/osconfig-ran'
    Important
    Important

    The osconfig playbook uses the stub file to mark already configured disks as "idempotent." To stop osconfig from identifying your new disk as already configured, you must remove the stub file /etc/hos/osconfig-ran before re-running the osconfig playbook.

  3. To run the playbook that enables auditing for a service, run:

    ardana > ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit KEY-API
    Important
    Important

    The variable KEY-API is used as an example to cover the management controller cluster. To enable auditing for a service that is not run on the same cluster, add the service to the –limit flag in the above command. For example:

    ardana > ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit KEY-API:NEU-SVR
13.2.7.2.4 To Reconfigure services for audit logging:
  1. To change to the directory containing the service playbooks, run:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
  2. To run the playbook that reconfigures a service for audit logging, run:

    ardana > ansible-playbook -i hosts/verb_hosts SERVICE_NAME-reconfigure.yml

    For example, to reconfigure keystone for audit logging, run:

    ardana > ansible-playbook -i hosts/verb_hosts keystone-reconfigure.yml
  3. Repeat steps 1 and 2 for each service you need to reconfigure.

    Important
    Important

    You must reconfigure each service that you changed to be enabled or disabled in the cloudConfig.yml file.

13.2.8 Troubleshooting

For information on troubleshooting Central Logging, see Section 18.7.1, “Troubleshooting Centralized Logging”.

13.3 Metering Service (ceilometer) Overview

The SUSE OpenStack Cloud metering service collects and provides access to OpenStack usage data that can be used for billing reporting such as showback and chargeback. The metering service can also provide general usage reporting. ceilometer acts as the central collection and data access service to the meters provided by all the OpenStack services. The data collected is available through the monasca API. ceilometer V2 API was deprecated in the Pike release upstream.

13.3.1 Metering Service New Functionality

13.3.1.1 New Metering Functionality in SUSE OpenStack Cloud 9

  • ceilometer is now integrated with monasca, using it as the datastore.

  • The default meters and other items configured for ceilometer can now be modified and additional meters can be added. We recommend that users test overall SUSE OpenStack Cloud performance prior to deploying any ceilometer modifications to ensure the addition of new notifications or polling events does not negatively affect overall system performance.

  • ceilometer Central Agent (pollster) is now called Polling Agent and is configured to support HA (Active-Active).

  • Notification Agent has built-in HA (Active-Active) with support for pipeline transformers, but workload partitioning has been disabled in SUSE OpenStack Cloud

  • SWIFT Poll-based account level meters will be enabled by default with an hourly collection cycle.

  • Integration with centralized monitoring (monasca) and centralized logging

  • Support for upgrade and reconfigure operations

13.3.1.2 Limitations

  • The Number of metadata attributes that can be extracted from resource_metadata has a maximum of 16. This is the number of fields in the metadata section of the monasca_field_definitions.yaml file for any service. It is also the number that is equal to fields in metadata.common and fields in metadata.<service.meters> sections. The total number of these fields cannot be more than 16.

  • Several network-related attributes are accessible using a colon ":" but are returned as a period ".". For example, you can access a sample list using the following command:

    ardana > source ~/service.osrc
    ardana > ceilometer --debug sample-list network -q "resource_id=421d50a5-156e-4cb9-b404-
    d2ce5f32f18b;resource_metadata.provider.network_type=flat"

    However, in response you will see the following:

    provider.network_type

    instead of

    provider:network_type

    This limitation is known for the following attributes:

    provider:network_type
    provider:physical_network
    provider:segmentation_id
  • ceilometer Expirer is not supported. Data retention expiration is handled by monasca with a default retention period of 45 days.

  • ceilometer Collector is not supported.

13.3.2 Understanding the Metering Service Concepts

13.3.2.1 Ceilometer Introduction

Before configuring the ceilometer Metering Service, it is important to understand how it works.

13.3.2.1.1 Metering Architecture

SUSE OpenStack Cloud automatically configures ceilometer to use Logging and Monitoring Service (monasca) as its backend. ceilometer is deployed on the same control plane nodes as monasca.

The installation of Celiometer creates several management nodes running different metering components.

ceilometer Components on Controller nodes

This controller node is the first of the High Available (HA) cluster.

ceilometer Sample Polling

Sample Polling is part of the Polling Agent. Messages are posted by the Notification Agent directly to monasca API.

ceilometer Polling Agent

The Polling Agent is responsible for coordinating the polling activity. It parses the pipeline.yml configuration file and identifies all the sources that need to be polled. The sources are then evaluated using a discovery mechanism and all the sources are translated to resources where a dedicated pollster can retrieve and publish data. At each identified interval the discovery mechanism is triggered, the resource list is composed, and the data is polled and sent to the queue.

ceilometer Collector No Longer Required

In previous versions, the collector was responsible for getting the samples/events from the RabbitMQ service and storing it in the main database. The ceilometer Collector is no longer enabled. Now that Notification Agent posts the data directly to monasca API, the collector is no longer required

13.3.2.1.2 Meter Reference

ceilometer collects basic information grouped into categories known as meters. A meter is the unique resource-usage measurement of a particular OpenStack service. Each OpenStack service defines what type of data is exposed for metering.

Each meter has the following characteristics:

AttributeDescription
NameDescription of the meter
Unit of MeasurementThe method by which the data is measured. For example: storage meters are defined in Gigabytes (GB) and network bandwidth is measured in Gigabits (Gb).
Type

The origin of the meter's data. OpenStack defines the following origins:

  • Cumulative - Increasing over time (instance hours)

  • Gauge - a discrete value. For example: the number of floating IP addresses or image uploads.

  • Delta - Changing over time (bandwidth)

A meter is defined for every measurable resource. A meter can exist beyond the actual existence of a particular resource, such as an active instance, to provision long-cycle use cases such as billing.

Important
Important

For a list of meter types and default meters installed with SUSE OpenStack Cloud, see Section 13.3.3, “Ceilometer Metering Available Meter Types”

The most common meter submission method is notifications. With this method, each service sends the data from their respective meters on a periodic basis to a common notifications bus.

ceilometer, in turn, pulls all of the events from the bus and saves the notifications in a ceilometer-specific database. The period of time that the data is collected and saved is known as the ceilometer expiry and is configured during ceilometer installation. Each meter is collected from one or more samples, gathered from the messaging queue or polled by agents. The samples are represented by counter objects. Each counter has the following fields:

AttributeDescription
counter_nameDescription of the counter
counter_unitThe method by which the data is measured. For example: data can be defined in Gigabytes (GB) or for network bandwidth, measured in Gigabits (Gb).
counter_typee

The origin of the counter's data. OpenStack defines the following origins:

  • Cumulative - Increasing over time (instance hours)

  • Gauge - a discrete value. For example: the number of floating IP addresses or image uploads.

  • Delta - Changing over time (bandwidth)

counter_volumeThe volume of data measured (CPU ticks, bytes transmitted, etc.). Not used for gauge counters. Set to a default value such as 1.
resource_idThe identifier of the resource measured (UUID)
project_idThe project (tenant) ID to which the resource belongs.
user_idThe ID of the user who owns the resource.
resource_metadataOther data transmitted in the metering notification payload.

13.3.3 Ceilometer Metering Available Meter Types

The Metering service contains three types of meters:

Cumulative

A cumulative meter measures data over time (for example, instance hours).

Gauge

A gauge measures discrete items (for example, floating IPs or image uploads) or fluctuating values (such as disk input or output).

Delta

A delta measures change over time, for example, monitoring bandwidth.

Each meter is populated from one or more samples, which are gathered from the messaging queue (listening agent), polling agents, or push agents. Samples are populated by counter objects.

Each counter contains the following fields:

name

the name of the meter

type

the type of meter (cumulative, gauge, or delta)

amount

the amount of data measured

unit

the unit of measure

resource

the resource being measured

project ID

the project the resource is assigned to

user

the user the resource is assigned to.

Note: The metering service shares the same High-availability proxy, messaging, and database clusters with the other Information services. To avoid unnecessarily high loads, Section 13.3.8, “Optimizing the Ceilometer Metering Service”.

13.3.3.1 SUSE OpenStack Cloud Default Meters

These meters are installed and enabled by default during an SUSE OpenStack Cloud installation. More information about ceilometer can be found at OpenStack ceilometer.

13.3.3.2 Compute (nova) Meters

MeterTypeUnitResourceOriginNote
vcpusGaugevcpuInstance IDNotificationNumber of virtual CPUs allocated to the instance
memoryGaugeMBInstance IDNotificationVolume of RAM allocated to the instance
memory.residentGaugeMBInstance IDPollsterVolume of RAM used by the instance on the physical machine
memory.usageGaugeMBInstance IDPollsterVolume of RAM used by the instance from the amount of its allocated memory
cpuCumulativensInstance IDPollsterCPU time used
cpu_utilGauge%Instance IDPollsterAverage CPU utilization
disk.read.requestsCumulativerequestInstance IDPollsterNumber of read requests
disk.read.requests.rateGaugerequest/sInstance IDPollsterAverage rate of read requests
disk.write.requestsCumulativerequestInstance IDPollsterNumber of write requests
disk.write.requests.rateGaugerequest/sInstance IDPollsterAverage rate of write requests
disk.read.bytesCumulativeBInstance IDPollsterVolume of reads
disk.read.bytes.rateGaugeB/sInstance IDPollsterAverage rate of reads
disk.write.bytesCumulativeBInstance IDPollsterVolume of writes
disk.write.bytes.rateGaugeB/sInstance IDPollsterAverage rate of writes
disk.root.sizeGaugeGBInstance IDNotificationSize of root disk
disk.ephemeral.sizeGaugeGBInstance IDNotificationSize of ephemeral disk
disk.device.read.requestsCumulativerequestDisk IDPollsterNumber of read requests
disk.device.read.requests.rateGaugerequest/sDisk IDPollsterAverage rate of read requests
disk.device.write.requestsCumulativerequestDisk IDPollsterNumber of write requests
disk.device.write.requests.rateGaugerequest/sDisk IDPollsterAverage rate of write requests
disk.device.read.bytesCumulativeBDisk IDPollsterVolume of reads
disk.device.read.bytes .rateGaugeB/sDisk IDPollsterAverage rate of reads
disk.device.write.bytesCumulativeBDisk IDPollsterVolume of writes
disk.device.write.bytes .rateGaugeB/sDisk IDPollsterAverage rate of writes
disk.capacityGaugeBInstance IDPollsterThe amount of disk that the instance can see
disk.allocationGaugeBInstance IDPollsterThe amount of disk occupied by the instance on the host machine
disk.usageGaugeBInstance IDPollsterThe physical size in bytes of the image container on the host
disk.device.capacityGaugeBDisk IDPollsterThe amount of disk per device that the instance can see
disk.device.allocationGaugeBDisk IDPollsterThe amount of disk per device occupied by the instance on the host machine
disk.device.usageGaugeBDisk IDPollsterThe physical size in bytes of the image container on the host per device
network.incoming.bytesCumulativeBInterface IDPollsterNumber of incoming bytes
network.outgoing.bytesCumulativeBInterface IDPollsterNumber of outgoing bytes
network.incoming.packetsCumulativepacketInterface IDPollsterNumber of incoming packets
network.outgoing.packetsCumulativepacketInterface IDPollsterNumber of outgoing packets

13.3.3.3 Compute Host Meters

MeterTypeUnitResourceOriginNote
compute.node.cpu.frequencyGaugeMHzHost IDNotificationCPU frequency
compute.node.cpu.kernel.timeCumulativensHost IDNotificationCPU kernel time
compute.node.cpu.idle.timeCumulativensHost IDNotificationCPU idle time
compute.node.cpu.user.timeCumulativensHost IDNotificationCPU user mode time
compute.node.cpu.iowait.timeCumulativensHost IDNotificationCPU I/O wait time
compute.node.cpu.kernel.percentGauge%Host IDNotificationCPU kernel percentage
compute.node.cpu.idle.percentGauge%Host IDNotificationCPU idle percentage
compute.node.cpu.user.percentGauge%Host IDNotificationCPU user mode percentage
compute.node.cpu.iowait.percentGauge%Host IDNotificationCPU I/O wait percentage
compute.node.cpu.percentGauge%Host IDNotificationCPU utilization

13.3.3.4 Image (glance) Meters

MeterTypeUnitResourceOriginNote
image.sizeGaugeBImage IDNotificationUploaded image size
image.updateDeltaImageImage IDNotificationNumber of uploads of the image
image.uploadDeltaImageimage IDnotificationNumber of uploads of the image
image.deleteDeltaImageImage IDNotificationNumber of deletes on the image

13.3.3.5 Volume (cinder) Meters

MeterTypeUnitResourceOriginNote
volume.sizeGaugeGBVol IDNotificationSize of volume
snapshot.sizeGaugeGBSnap IDNotificationSize of snapshot's volume

13.3.3.6 Storage (swift) Meters

MeterTypeUnitResourceOriginNote
storage.objectsGaugeObjectStorage IDPollsterNumber of objects
storage.objects.sizeGaugeBStorage IDPollsterTotal size of stored objects
storage.objects.containersGaugeContainerStorage IDPollsterNumber of containers

The resource_id for any ceilometer query is the tenant_id for the swift object because swift usage is rolled up at the tenant level.

13.3.4 Configure the Ceilometer Metering Service

SUSE OpenStack Cloud 9 automatically deploys ceilometer to use the monasca database. ceilometer is deployed on the same control plane nodes along with other OpenStack services such as keystone, nova, neutron, glance, and swift.

The Metering Service can be configured using one of the procedures described below.

13.3.4.1 Run the Upgrade Playbook

Follow Standard Service upgrade mechanism available in the Cloud Lifecycle Manager distribution. For ceilometer, the playbook included with SUSE OpenStack Cloud is ceilometer-upgrade.yml

13.3.4.2 Enable Services for Messaging Notifications

After installation of SUSE OpenStack Cloud, the following services are enabled by default to send notifications:

  • nova

  • cinder

  • glance

  • neutron

  • swift

The list of meters for these services are specified in the Notification Agent or Polling Agent's pipeline configuration file.

For steps on how to edit the pipeline configuration files, see: Section 13.3.5, “Ceilometer Metering Service Notifications”

13.3.4.3 Restart the Polling Agent

The Polling Agent is responsible for coordinating the polling activity. It parses the pipeline.yml configuration file and identifies all the sources where data is collected. The sources are then evaluated and are translated to resources that a dedicated pollster can retrieve. The Polling Agent follows this process:

  1. At each identified interval, the pipeline.yml configuration file is parsed.

  2. The resource list is composed.

  3. The pollster collects the data.

  4. The pollster sends data to the queue.

Metering processes should normally be operating at all times. This need is addressed by the Upstart event engine which is designed to run on any Linux system. Upstart creates events, handles the consequences of those events, and starts and stops processes as required. Upstart will continually attempt to restart stopped processes even if the process was stopped manually. To stop or start the Polling Agent and avoid the conflict with Upstart, using the following steps.

To restart the Polling Agent:

  1. To determine whether the process is running, run:

    tux > sudo systemctl status ceilometer-agent-notification
    #SAMPLE OUTPUT:
    ceilometer-agent-notification.service - ceilometer-agent-notification Service
       Loaded: loaded (/etc/systemd/system/ceilometer-agent-notification.service; enabled; vendor preset: disabled)
       Active: active (running) since Tue 2018-06-12 05:07:14 UTC; 2 days ago
     Main PID: 31529 (ceilometer-agen)
        Tasks: 69
       CGroup: /system.slice/ceilometer-agent-notification.service
               ├─31529 ceilometer-agent-notification: master process [/opt/stack/service/ceilometer-agent-notification/venv/bin/ceilometer-agent-notification --config-file /opt/stack/service/ceilometer-agent-noti...
               └─31621 ceilometer-agent-notification: NotificationService worker(0)
    
    Jun 12 05:07:14 ardana-qe201-cp1-c1-m2-mgmt systemd[1]: Started ceilometer-agent-notification Service.
  2. To stop the process, run:

    tux > sudo systemctl stop ceilometer-agent-notification
  3. To start the process, run:

    tux > sudo systemctl start ceilometer-agent-notification

13.3.4.4 Replace a Logging, Monitoring, and Metering Controller

In a medium-scale environment, if a metering controller has to be replaced or rebuilt, use the following steps:

  1. Section 15.1.2.1, “Replacing a Controller Node”.

  2. If the ceilometer nodes are not on the shared control plane, to implement the changes and replace the controller, you must reconfigure ceilometer. To do this, run the ceilometer-reconfigure.yml ansible playbook without the limit option

13.3.4.5 Configure Monitoring

The monasca HTTP Process monitors ceilometer's notification and polling agents are monitored. If these agents are down, monasca monitoring alarms are triggered. You can use the notification alarms to debug the issue and restart the notifications agent. However, for Central-Agent (polling) and Collector the alarms need to be deleted. These two processes are not started after an upgrade so when the monitoring process checks the alarms for these components, they will be in UNDETERMINED state. SUSE OpenStack Cloud does not monitor these processes anymore. To resolve this issue, manually delete alarms that are no longer used but are installed.

To resolve notification alarms, first check the ceilometer-agent-notification logs for errors in the /var/log/ceilometer directory. You can also use the Operations Console to access Kibana and check the logs. This will help you understand and debug the error.

To restart the service, run the ceilometer-start.yml. This playbook starts the ceilometer processes that has stopped and only restarts during install, upgrade or reconfigure which is what is needed in this case. Restarting the process that has stopped will resolve this alarm because this monasca alarm means that ceilometer-agent-notification is no longer running on certain nodes.

You can access ceilometer data through monasca. ceilometer publishes samples to monasca with credentials of the following accounts:

  • ceilometer user

  • services

Data collected by ceilometer can also be retrieved by the monasca REST API. Make sure you use the following guidelines when requesting data from the monasca REST API:

  • Verify you have the monasca-admin role. This role is configured in the monasca-api configuration file.

  • Specify the tenant id of the services project.

For more details, read the monasca API Specification.

To run monasca commands at the command line, you must be have the admin role. This allows you to use the ceilometer account credentials to replace the default admin account credentials defined in the service.osrc file. When you use the ceilometer account credentials, monasca commands will only return data collected by ceilometer. At this time, monasca command line interface (CLI) does not support the data retrieval of other tenants or projects.

13.3.5 Ceilometer Metering Service Notifications

ceilometer uses the notification agent to listen to the message queue, convert notifications to Events and Samples, and apply pipeline actions.

13.3.5.1 Manage Whitelisting and Polling

SUSE OpenStack Cloud is designed to reduce the amount of data that is stored. SUSE OpenStack Cloud's use of a SQL-based cluster, which is not recommended for big data, means you must control the data that ceilometer collects. You can do this by filtering (whitelisting) the data or by using the configuration files for the ceilometer Polling Agent and the ceilometer Notificfoation Agent.

Whitelisting is used in a rule specification as a positive filtering parameter. Whitelist is only included in rules that can be used in direct mappings, for identity service issues such as service discovery, provisioning users, groups, roles, projects, domains as well as user authentication and authorization.

You can run tests against specific scenarios to see if filtering reduces the amount of data stored. You can create a test by editing or creating a run filter file (whitelist). For steps on how to do this, see: Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 38 “Post Installation Tasks”, Section 38.1 “API Verification”.

ceilometer Polling Agent (polling agent) and ceilometer Notification Agent (notification agent) use different pipeline.yaml files to configure meters that are collected. This prevents accidentally polling for meters which can be retrieved by the polling agent as well as the notification agent. For example, glance image and image.size are meters which can be retrieved both by polling and notifications.

In both of the separate configuration files, there is a setting for interval. The interval attribute determines the frequency, in seconds, of how often data is collected. You can use this setting to control the amount of resources that are used for notifications and for polling. For example, you want to use more resources for notifications and less for polling. To accomplish this you would set the interval in the polling configuration file to a large amount of time, such as 604800 seconds, which polls only once a week. Then in the notifications configuration file, you can set the interval to a higher amount, such as collecting data every 30 seconds.

Important
Important

swift account data will be collected using the polling mechanism in an hourly interval.

Setting this interval to manage both notifications and polling is the recommended procedure when using a SQL cluster back-end.

Sample ceilometer Polling Agent file:

#File: ~/opt/stack/service/ceilometer-polling/etc/pipeline-polling.yaml
---
sources:
    - name: swift_source
      interval: 3600
      meters:
          - "storage.objects"
          - "storage.objects.size"
          - "storage.objects.containers"
      resources:
      discovery:
      sinks:
          - meter_sink
sinks:
    - name: meter_sink
      transformers:
      publishers:
         - notifier://

Sample ceilometer Notification Agent(notification agent) file:

#File:    ~/opt/stack/service/ceilometer-agent-notification/etc/pipeline-agent-notification.yaml
---
sources:
    - name: meter_source
      interval: 30
      meters:
          - "instance"
          - "image"
          - "image.size"
          - "image.upload"
          - "image.delete"
          - "volume"
          - "volume.size"
          - "snapshot"
          - "snapshot.size"
          - "ip.floating"
          - "network"
          - "network.create"
          - "network.update"
resources:
discovery:
sinks:
          - meter_sink
sinks:
    - name: meter_sink
      transformers:
      publishers:
         - notifier://

Both of the pipeline files have two major sections:

Sources

represents the data that is collected either from notifications posted by services or through polling. In the Sources section there is a list of meters. These meters define what kind of data is collected. For a full list refer to the ceilometer documentation available at: Telemetry Measurements

Sinks

represents how the data is modified before it is published to the internal queue for collection and storage.

You will only need to change a setting in the Sources section to control the data collection interval.

For more information, see Telemetry Measurements

To change the ceilometer Polling Agent interval setting:

  1. To find the polling agent configuration file, run:

    cd ~/opt/stack/service/ceilometer-polling/etc
  2. In a text editor, open the following file:

    pipeline-polling.yaml
  3. In the following section, change the value of interval to the desired amount of time:

    ---
    sources:
        - name: swift_source
          interval: 3600
          meters:
              - "storage.objects"
              - "storage.objects.size"
              - "storage.objects.containers"
          resources:
          discovery:
          sinks:
              - meter_sink
    sinks:
        - name: meter_sink
          transformers:
          publishers:
             - notifier://

    In the sample code above, the polling agent will collect data every 600 seconds, or 10 minutes.

To change the ceilometer Notification Agent (notification agent) interval setting:

  1. To find the notification agent configuration file, run:

    cd /opt/stack/service/ceilometer-agent-notification
  2. In a text editor, open the following file:

    pipeline-agent-notification.yaml
  3. In the following section, change the value of interval to the desired amount of time:

    sources:
        - name: meter_source
          interval: 30
          meters:
              - "instance"
              - "image"
              - "image.size"
              - "image.upload"
              - "image.delete"
              - "volume"
              - "volume.size"
              - "snapshot"
              - "snapshot.size"
              - "ip.floating"
              - "network"
              - "network.create"
              - "network.update"

    In the sample code above, the notification agent will collect data every 30 seconds.

Note
Note

The pipeline-agent-notification.yaml file needs to be changed on all controller nodes to change the white-listing and polling strategy.

13.3.5.2 Edit the List of Meters

The number of enabled meters can be reduced or increased by editing the pipeline configuration of the notification and polling agents. To deploy these changes you must then restart the agent. If pollsters and notifications are both modified, then you will have to restart both the Polling Agent and the Notification Agent. ceilometer Collector will also need to be restarted. The following code is an example of a compute-only ceilometer Notification Agent (notification agent) pipeline-agent-notification.yaml file:

---
sources:
    - name: meter_source
      interval: 86400
      meters:
          - "instance"
          - "memory"
          - "vcpus"
          - "compute.instance.create.end"
          - "compute.instance.delete.end"
          - "compute.instance.update"
          - "compute.instance.exists"
      sinks:
          - meter_sink
sinks:
    - name: meter_sink
      transformers:
      publishers:
          - notifier://
Important
Important

If you enable meters at the container level in this file, every time the polling interval triggers a collection, at least 5 messages per existing container in swift are collected.

The following table illustrates the amount of data produced hourly in different scenarios:

swift Containersswift Objects per containerSamples per HourSamples stored per 24 hours
101050012000
101005000120000
100100500001200000
100100050000012000000

The data in the table shows that even a very small swift storage with 10 containers and 100 files will store 120,000 samples in 24 hours, generating a total of 3.6 million samples.

Important
Important

The size of each file does not have any impact on the number of samples collected. As shown in the table above, the smallest number of samples results from polling when there are a small number of files and a small number of containers. When there are a lot of small files and containers, the number of samples is the highest.

13.3.5.3 Add Resource Fields to Meters

By default, not all the resource metadata fields for an event are recorded and stored in ceilometer. If you want to collect metadata fields for a consumer application, for example, it is easier to add a field to an existing meter rather than creating a new meter. If you create a new meter, you must also reconfigure ceilometer.

Important
Important

Consider the following information before you add or edit a meter:

  • You can add a maximum of 12 new fields.

  • Adding or editing a meter causes all non-default meters to STOP receiving notifications. You will need to restart ceilometer.

  • New meters added to the pipeline-polling.yaml.j2 file must also be added to the pipeline-agent-notification.yaml.j2 file. This is due to the fact that polling meters are drained by the notification agent and not by the collector.

  • After SUSE OpenStack Cloud is installed, services like compute, cinder, glance, and neutron are configured to publish ceilometer meters by default. Other meters can also be enabled after the services are configured to start publishing the meter. The only requirement for publishing a meter is that the origin must have a value of notification. For a complete list of meters, see the OpenStack documentation on Measurements.

  • Not all meters are supported. Meters collected by ceilometer Compute Agent or any agent other than ceilometer Polling are not supported or tested with SUSE OpenStack Cloud.

  • Identity meters are disabled by keystone.

  • To enable ceilometer to start collecting meters, some services require you enable the meters you need in the service first before enabling them in ceilometer. Refer to the documentation for the specific service before you add new meters or resource fields.

To add Resource Metadata fields:

  1. Log on to the Cloud Lifecycle Manager (deployer node).

  2. To change to the ceilometer directory, run:

    ardana > cd ~/openstack/my_cloud/config/ceilometer
  3. In a text editor, open the target configuration file (for example, monasca-field-definitions.yaml.j2).

  4. In the metadata section, either add a new meter or edit an existing one provided by SUSE OpenStack Cloud.

  5. Include the metadata fields you need. You can use the instance meter in the file as an example.

  6. Save and close the configuration file.

  7. To save your changes in SUSE OpenStack Cloud, run:

    ardana > cd ~/openstack
    ardana > git add -A
    ardana > git commit -m "My config"
  8. If you added a new meter, reconfigure ceilometer:

    ardana > cd ~/openstack/ardana/ansible/
    # To run the config-processor playbook:
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
    #To run the ready-deployment playbook:
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts ceilometer-reconfigure.yml

13.3.5.4 Update the Polling Strategy and Swift Considerations

Polling can be very taxing on the system due to the sheer volume of data that the system may have to process. It also has a severe impact on queries since the database will now have a very large amount of data to scan to respond to the query. This consumes a great amount of cpu and memory. This can result in long wait times for query responses, and in extreme cases can result in timeouts.

There are 3 polling meters in swift:

  • storage.objects

  • storage.objects.size

  • storage.objects.containers

Here is an example of pipeline.yml in which swift polling is set to occur hourly.

---
      sources:
      - name: swift_source
      interval: 3600
      meters:
      - "storage.objects"
      - "storage.objects.size"
      - "storage.objects.containers"
      resources:
      discovery:
      sinks:
      - meter_sink
      sinks:
      - name: meter_sink
      transformers:
      publishers:
      - notifier://

With this configuration above, we did not enable polling of container based meters and we only collect 3 messages for any given tenant, one for each meter listed in the configuration files. Since we have 3 messages only per tenant, it does not create a heavy load on the MySQL database as it would have if container-based meters were enabled. Hence, other APIs are not hit because of this data collection configuration.

13.3.6 Ceilometer Metering Setting Role-based Access Control

Role Base Access Control (RBAC) is a technique that limits access to resources based on a specific set of roles associated with each user's credentials.

keystone has a set of users that are associated with each project. Each user has at least one role. After a user has authenticated with keystone using a valid set of credentials, keystone will augment that request with the Roles that are associated with that user. These roles are added to the Request Header under the X-Roles attribute and are presented as a comma-separated list.

13.3.6.1 Displaying All Users

To discover the list of users available in the system, an administrator can run the following command using the keystone command-line interface:

ardana > source ~/service.osrc
ardana > openstack user list

The output should resemble this response, which is a list of all the users currently available in this system.

+----------------------------------+-----------------------------------------+----+
|                id                |    name      | enabled |       email        |
+----------------------------------+-----------------------------------------+----+
| 1c20d327c92a4ea8bb513894ce26f1f1 |   admin      |   True  | admin.example.com  |
| 0f48f3cc093c44b4ad969898713a0d65 | ceilometer   |   True  | nobody@example.com |
| 85ba98d27b1c4c8f97993e34fcd14f48 |   cinder     |   True  | nobody@example.com |
| d2ff982a0b6547d0921b94957db714d6 |    demo      |   True  |  demo@example.com  |
| b2d597e83664489ebd1d3c4742a04b7c |    ec2       |   True  | nobody@example.com |
| 2bd85070ceec4b608d9f1b06c6be22cb |   glance     |   True  | nobody@example.com |
| 0e9e2daebbd3464097557b87af4afa4c |    heat      |   True  | nobody@example.com |
| 0b466ddc2c0f478aa139d2a0be314467 |  neutron     |   True  | nobody@example.com |
| 5cda1a541dee4555aab88f36e5759268 |    nova      |   True  | nobody@example.com ||
| 5cda1a541dee4555aab88f36e5759268 |    nova      |   True  | nobody@example.com |
| 1cefd1361be8437d9684eb2add8bdbfa |   swift      |   True  | nobody@example.com |
| f05bac3532c44414a26c0086797dab23 | user20141203213957|True| nobody@example.com |
| 3db0588e140d4f88b0d4cc8b5ca86a0b | user20141205232231|True| nobody@example.com |
+----------------------------------+-----------------------------------------+----+

13.3.6.2 Displaying All Roles

To see all the roles that are currently available in the deployment, an administrator (someone with the admin role) can run the following command:

ardana > source ~/service.osrc
ardana > openstack role list

The output should resemble the following response:

+----------------------------------+-------------------------------------+
|                id                |                 name                |
+----------------------------------+-------------------------------------+
| 507bface531e4ac2b7019a1684df3370 |            ResellerAdmin            |
| 9fe2ff9ee4384b1894a90878d3e92bab |               member                |
| e00e9406b536470dbde2689ce1edb683 |                admin                |
| aa60501f1e664ddab72b0a9f27f96d2c |           heat_stack_user           |
| a082d27b033b4fdea37ebb2a5dc1a07b |               service               |
| 8f11f6761534407585feecb5e896922f |            swiftoperator            |
+----------------------------------+-------------------------------------+

13.3.6.3 Assigning a Role to a User

In this example, we want to add the role ResellerAdmin to the demo user who has the ID d2ff982a0b6547d0921b94957db714d6.

  1. Determine which Project/Tenant the user belongs to.

    ardana > source ~/service.osrc
    ardana > openstack user show d2ff982a0b6547d0921b94957db714d6

    The response should resemble the following output:

    +---------------------+----------------------------------+
    | Field               | Value                            |
    +---------------------+----------------------------------+
    | domain_id           | default                          |
    | enabled             | True                             |
    |    id               | d2ff982a0b6547d0921b94957db714d6 |
    | name                | admin                            |
    | options             | {}                               |
    | password_expires_at | None                             |
    +---------------------+----------------------------------+
  2. We need to link the ResellerAdmin Role to a Project/Tenant. To start, determine which tenants are available on this deployment.

    ardana > source ~/service.osrc
    ardana > openstack project list

    The response should resemble the following output:

    +----------------------------------+-------------------------------+--+
    |                id                |        name       | enabled |
    +----------------------------------+-------------------------------+--+
    | 4a8f4207a13444089a18dc524f41b2cf |       admin       |   True  |
    | 00cbaf647bf24627b01b1a314e796138 |        demo       |   True  |
    | 8374761f28df43b09b20fcd3148c4a08 |        gf1        |   True  |
    | 0f8a9eef727f4011a7c709e3fbe435fa |        gf2        |   True  |
    | 6eff7b888f8e470a89a113acfcca87db |        gf3        |   True  |
    | f0b5d86c7769478da82cdeb180aba1b0 |        jaq1       |   True  |
    | a46f1127e78744e88d6bba20d2fc6e23 |        jaq2       |   True  |
    | 977b9b7f9a6b4f59aaa70e5a1f4ebf0b |        jaq3       |   True  |
    | 4055962ba9e44561ab495e8d4fafa41d |        jaq4       |   True  |
    | 33ec7f15476545d1980cf90b05e1b5a8 |        jaq5       |   True  |
    | 9550570f8bf147b3b9451a635a1024a1 |      service      |   True  |
    +----------------------------------+-------------------------------+--+
  3. Now that we have all the pieces, we can assign the ResellerAdmin role to this User on the Demo project.

    ardana > openstack role add --user d2ff982a0b6547d0921b94957db714d6 --project 00cbaf647bf24627b01b1a314e796138 507bface531e4ac2b7019a1684df3370

    This will produce no response if everything is correct.

  4. Validate that the role has been assigned correctly. Pass in the user and tenant ID and request a list of roles assigned.

    ardana > openstack role list --user d2ff982a0b6547d0921b94957db714d6 --project 00cbaf647bf24627b01b1a314e796138

    Note that all members have the member role as a default role in addition to any other roles that have been assigned.

    +----------------------------------+---------------+----------------------------------+----------------------------------+
    |                id                |      name     |             user_id              | tenant_id             |
    +----------------------------------+---------------+----------------------------------+----------------------------------+
    | 507bface531e4ac2b7019a1684df3370 | ResellerAdmin | d2ff982a0b6547d0921b94957db714d6 | 00cbaf647bf24627b01b1a314e796138 |
    | 9fe2ff9ee4384b1894a90878d3e92bab |    member     | d2ff982a0b6547d0921b94957db714d6 | 00cbaf647bf24627b01b1a314e796138 |
    +----------------------------------+---------------+----------------------------------+----------------------------------+

13.3.6.4 Creating a New Role

In this example, we will create a Level 3 Support role called L3Support.

  1. Add the new role to the list of roles.

    ardana > openstack role create L3Support

    The response should resemble the following output:

    +----------+----------------------------------+
    | Property |              Value               |
    +----------+----------------------------------+
    |    id    | 7e77946db05645c4ba56c6c82bf3f8d2 |
    |   name   |            L3Support             |
    +----------+----------------------------------+
  2. Now that we have the new role's ID, we can add that role to the Demo user from the previous example.

    ardana > openstack role add --user d2ff982a0b6547d0921b94957db714d6  --project 00cbaf647bf24627b01b1a314e796138 7e77946db05645c4ba56c6c82bf3f8d2

    This will produce no response if everything is correct.

  3. Verify that the user Demo has both the ResellerAdmin and L3Support roles.

    ardana > openstack role list --user d2ff982a0b6547d0921b94957db714d6 --project 00cbaf647bf24627b01b1a314e796138
  4. The response should resemble the following output. Note that this user has the L3Support role, the ResellerAdmin role, and the default member role.

    +----------------------------------+---------------+----------------------------------+----------------------------------+
    |                id                |      name     |             user_id              |            tenant_id             |
    +----------------------------------+---------------+----------------------------------+----------------------------------+
    | 7e77946db05645c4ba56c6c82bf3f8d2 |   L3Support   | d2ff982a0b6547d0921b94957db714d6 | 00cbaf647bf24627b01b1a314e796138 |
    | 507bface531e4ac2b7019a1684df3370 | ResellerAdmin | d2ff982a0b6547d0921b94957db714d6 | 00cbaf647bf24627b01b1a314e796138 |
    | 9fe2ff9ee4384b1894a90878d3e92bab |    member     | d2ff982a0b6547d0921b94957db714d6 | 00cbaf647bf24627b01b1a314e796138 |
    +----------------------------------+---------------+----------------------------------+----------------------------------+

13.3.6.5 Access Policies

Before introducing RBAC, ceilometer had very simple access control. There were two types of user: admins and users. Admins will be able to access any API and perform any operation. Users will only be able to access non-admin APIs and perform operations only on the Project/Tenant where they belonged.

13.3.7 Ceilometer Metering Failover HA Support

In the SUSE OpenStack Cloud environment, the ceilometer metering service supports native Active-Active high-availability (HA) for the notification and polling agents. Implementing HA support includes workload-balancing, workload-distribution and failover.

Tooz is the coordination engine that is used to coordinate workload among multiple active agent instances. It also maintains the knowledge of active-instance-to-handle failover and group membership using hearbeats (pings).

Zookeeper is used as the coordination backend. Zookeeper uses Tooz to expose the APIs that manage group membership and retrieve workload specific to each agent.

The following section in the configuration file is used to implement high-availability (HA):

[coordination]
backend_url = <IP address of Zookeeper host: port> (port is usually 2181 as a zookeeper default)
heartbeat = 1.0
check_watchers = 10.0

For the notification agent to be configured in HA mode, additional configuration is needed in the configuration file:

[notification]
workload_partitioning = true

The HA notification agent distributes workload among multiple queues that are created based on the number of unique source:sink combinations. The combinations are configured in the notification agent pipeline configuration file. If there are additional services to be metered using notifications, then the recommendation is to use a separate source for those events. This is recommended especially if the expected load of data from that source is considered high. Implementing HA support should lead to better workload balancing among multiple active notification agents.

ceilometer-expirer is also an Active-Active HA. Tooz is used to pick an expirer process that acquires a lock when there are multiple contenders and the winning process runs. There is no failover support, as expirer is not a daemon and is scheduled to run at pre-determined intervals.

Important
Important

You must ensure that a single expirer process runs when multiple processes are scheduled to run at the same time. This must be done using cron-based scheduling. on multiple controller nodes

The following configuration is needed to enable expirer HA:

[coordination]
backend_url = <IP address of Zookeeper host: port> (port is usually 2181 as a zookeeper default)
heartbeat = 1.0
check_watchers = 10.0

The notification agent HA support is mainly designed to coordinate among notification agents so that correlated samples can be handled by the same agent. This happens when samples get transformed from other samples. The SUSE OpenStack Cloud ceilometer pipeline has no transformers, so this task of coordination and workload partitioning does not need to be enabled. The notification agent is deployed on multiple controller nodes and they distribute workload among themselves by randomly fetching the data from the queue.

To disable coordination and workload partitioning by OpenStack, set the following value in the configuration file:

        [notification]
        workload_partitioning = False
Important
Important

When a configuration change is made to an API running under the HA Proxy, that change needs to be replicated in all controllers.

13.3.8 Optimizing the Ceilometer Metering Service

You can improve ceilometer responsiveness by configuring metering to store only the data you are require. This topic provides strategies for getting the most out of metering while not overloading your resources.

13.3.8.1 Change the List of Meters

The list of meters can be easily reduced or increased by editing the pipeline.yaml file and restarting the polling agent.

Sample compute-only pipeline.yaml file with the daily poll interval:

---
sources:
    - name: meter_source
      interval: 86400
      meters:
          - "instance"
          - "memory"
          - "vcpus"
          - "compute.instance.create.end"
          - "compute.instance.delete.end"
          - "compute.instance.update"
          - "compute.instance.exists"
      sinks:
          - meter_sink
sinks:
    - name: meter_sink
      transformers:
      publishers:
          - notifier://
Note
Note

This change will cause all non-default meters to stop receiving notifications.

13.3.8.2 Enable Nova Notifications

You can configure nova to send notifications by enabling the setting in the configuration file. When enabled, nova will send information to ceilometer related to its usage and VM status. You must restart nova for these changes to take effect.

The Openstack notification daemon, also known as a polling agent, monitors the message bus for data being provided by other OpenStack components such as nova. The notification daemon loads one or more listener plugins, using the ceilometer.notification namespace. Each plugin can listen to any topic, but by default it will listen to the notifications.info topic. The listeners grab messages off the defined topics and redistribute them to the appropriate plugins (endpoints) to be processed into Events and Samples. After the nova service is restarted, you should verify that the notification daemons are receiving traffic.

For a more in-depth look at how information is sent over openstack.common.rpc, refer to the OpenStack ceilometer documentation.

nova can be configured to send following data to ceilometer:

Name Unit Type Resource Note
instanceginstance inst IDExistence of instance
instance: type ginstance inst IDExistence of instance of type (Where type is a valid OpenStack type.)
memorygMB inst IDAmount of allocated RAM. Measured in MB.
vcpusgvcpu inst IDNumber of VCPUs
disk.root.sizegGB inst IDSize of root disk. Measured in GB.
disk.ephemeral.sizegGB inst IDSize of ephemeral disk. Measured in GB.

To enable nova to publish notifications:

  1. In a text editor, open the following file:

    nova.conf
  2. Compare the example of a working configuration file with the necessary changes to your configuration file. If there is anything missing in your file, add it, and then save the file.

    notification_driver=messaging
    notification_topics=notifications
    notify_on_state_change=vm_and_task_state
    instance_usage_audit=True
    instance_usage_audit_period=hour
    Important
    Important

    The instance_usage_audit_period interval can be set to check the instance's status every hour, once a day, once a week or once a month. Every time the audit period elapses, nova sends a notification to ceilometer to record whether or not the instance is alive and running. Metering this statistic is critical if billing depends on usage.

  3. To restart nova service, run:

    tux > sudo systemctl restart nova-api.service
    tux > sudo systemctl restart nova-conductor.service
    tux > sudo systemctl restart nova-scheduler.service
    tux > sudo systemctl restart nova-novncproxy.service
    Important
    Important

    Different platforms may use their own unique command to restart nova-compute services. If the above command does not work, please refer to the documentation for your specific platform.

  4. To verify successful launch of each process, list the service components:

    ardana > source ~/service.osrc
    ardana > openstack compute service list
    +----+------------------+------------+----------+---------+-------+----------------------------+-----------------+
    | Id | Binary           | Host       | Zone     | Status  | State | Updated_at                 | Disabled Reason |
    +----+------------------+------------+----------+---------+-------+----------------------------+-----------------+
    | 1  | nova-conductor   | controller | internal | enabled | up    | 2014-09-16T23:54:02.000000 | -               |
    | 3  | nova-scheduler   | controller | internal | enabled | up    | 2014-09-16T23:54:07.000000 | -               |
    | 4  | nova-cert        | controller | internal | enabled | up    | 2014-09-16T23:54:00.000000 | -               |
    | 5  | nova-compute     | compute1   | nova     | enabled | up    | 2014-09-16T23:54:06.000000 | -               |
    +----+------------------+------------+----------+---------+-------+----------------------------+-----------------+

13.3.8.3 Improve Reporting API Responsiveness

Reporting APIs are the main access to the metering data stored in ceilometer. These APIs are accessed by horizon to provide basic usage data and information.

SUSE OpenStack Cloud uses Apache2 Web Server to provide the API access. This topic provides some strategies to help you optimize the front-end and back-end databases.

To improve the responsiveness you can increase the number of threads and processes in the ceilometer configuration file. Each process can have a certain amount of threads managing the filters and applications, which can comprise the processing pipeline.

To configure Apache2 to use increase the number of threads, use the steps in Section 13.3.4, “Configure the Ceilometer Metering Service”

Warning
Warning

The resource usage panel could take some time to load depending on the number of metrics selected.

13.3.8.4 Update the Polling Strategy and Swift Considerations

Polling can put an excessive amount of strain on the system due to the amount of data the system may have to process. Polling also has a severe impact on queries since the database can have very large amount of data to scan before responding to the query. This process usually consumes a large amount of CPU and memory to complete the requests. Clients can also experience long waits for queries to come back and, in extreme cases, even timeout.

There are 3 polling meters in swift:

  • storage.objects

  • storage.objects.size

  • storage.objects.containers

Sample section of the pipeline.yaml configuration file with swift polling on an hourly interval:

---
sources:
    - name: swift_source
      interval: 3600
      sources:
            meters:
          - "storage.objects"
          - "storage.objects.size"
          - "storage.objects.containers"
sinks:
    - name: meter_sink
      transformers:
      publishers:
          - notifier://

Every time the polling interval occurs, at least 3 messages per existing object/container in swift are collected. The following table illustrates the amount of data produced hourly in different scenarios:

swift Containersswift Objects per containerSamples per HourSamples stored per 24 hours
101050012000
101005000120000
100100500001200000
100100050000012000000

Looking at the data we can see that even a very small swift storage with 10 containers and 100 files will store 120K samples in 24 hours, bringing it to a total of 3.6 million samples.

Note
Note

The file size of each file does not have any impact on the number of samples collected. In fact the smaller the number of containers or files, the smaller the sample size. In the scenario where there a large number of small files and containers, the sample size is also large and the performance is at its worst.

13.3.9 Metering Service Samples

Samples are discrete collections of a particular meter or the actual usage data defined by a meter description. Each sample is time-stamped and includes a variety of data that varies per meter but usually includes the project ID and UserID of the entity that consumed the resource represented by the meter and sample.

In a typical deployment, the number of samples can be in the tens of thousands if not higher for a specific collection period depending on overall activity.

Sample collection and data storage expiry settings are configured in ceilometer. Use cases that include collecting data for monthly billing cycles are usually stored over a period of 45 days and require a large, scalable, back-end database to support the large volume of samples generated by production OpenStack deployments.

Example configuration:

[database]
metering_time_to_live=-1

In our example use case, to construct a complete billing record, an external billing application must collect all pertinent samples. Then the results must be sorted, summarized, and combine with the results of other types of metered samples that are required. This function is known as aggregation and is external to the ceilometer service.

Meter data, or samples, can also be collected directly from the service APIs by individual ceilometer polling agents. These polling agents directly access service usage by calling the API of each service.

OpenStack services such as swift currently only provide metered data through this function and some of the other OpenStack services provide specific metrics only through a polling action.

14 Managing Container as a Service (Magnum)

The SUSE OpenStack Cloud Magnum Service provides container orchestration engines such as Docker Swarm, Kubernetes, and Apache Mesos available as first class resources. SUSE OpenStack Cloud Magnum uses heat to orchestrate an OS image which contains Docker and Kubernetes and runs that image in either virtual machines or bare metal in a cluster configuration.

14.1 Deploying a Kubernetes Cluster on Fedora Atomic

14.1.1 Prerequisites

These steps assume the following have been completed:

  • The Magnum service has been installed. For more information, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 26 “Magnum Overview”, Section 26.2 “Install the Magnum Service”.

  • Deploying a Kubernetes Cluster on Fedora Atomic requires the Fedora Atomic image fedora-atomic-26-20170723.0.x86_64.qcow2 prepared specifically for the OpenStack release. You can download the fedora-atomic-26-20170723.0.x86_64.qcow2 image from https://fedorapeople.org/groups/magnum/

14.1.2 Creating the Cluster

The following example is created using Kubernetes Container Orchestration Engine (COE) running on Fedora Atomic guest OS on SUSE OpenStack Cloud VMs.

  1. As stack user, login to the lifecycle manager.

  2. Source openstack admin credentials.

    $ source service.osrc
  3. If you haven't already, download Fedora Atomic image, prepared for the Openstack Pike release.

    $ wget https://download.fedoraproject.org/pub/alt/atomic/stable/Fedora-Atomic-26-20170723.0/CloudImages/x86_64/images/Fedora-Atomic-26-20170723.0.x86_64.qcow2
  4. Create a glance image.

    $ openstack image create --name fedora-atomic-26-20170723.0.x86_64 --visibility public \
      --disk-format qcow2 --os-distro fedora-atomic --container-format bare \
      --file Fedora-Atomic-26-20170723.0.x86_64.qcow2 --progress
    [=============================>] 100%
    +------------------+--------------------------------------+
    | Property         | Value                                |
    +------------------+--------------------------------------+
    | checksum         | 9d233b8e7fbb7ea93f20cc839beb09ab     |
    | container_format | bare                                 |
    | created_at       | 2017-04-10T21:13:48Z                 |
    | disk_format      | qcow2                                |
    | id               | 4277115a-f254-46c0-9fb0-fffc45d2fd38 |
    | min_disk         | 0                                    |
    | min_ram          | 0                                    |
    | name             | fedora-atomic-26-20170723.0.x86_64   |
    | os_distro        | fedora-atomic                        |
    | owner            | 2f5b83ab49d54aaea4b39f5082301d09     |
    | protected        | False                                |
    | size             | 515112960                            |
    | status           | active                               |
    | tags             | []                                   |
    | updated_at       | 2017-04-10T21:13:56Z                 |
    | virtual_size     | None                                 |
    | visibility       | public                               |
    +------------------+--------------------------------------+
  5. Create a nova keypair.

    $ test -f ~/.ssh/id_rsa.pub || ssh-keygen -t rsa -N "" -f ~/.ssh/id_rsa
    $ openstack keypair create --pub-key ~/.ssh/id_rsa.pub testkey
  6. Create a Magnum cluster template.

    $ magnum cluster-template-create --name my-template \
      --image-id 4277115a-f254-46c0-9fb0-fffc45d2fd38 \
      --keypair-id testkey \
      --external-network-id ext-net \
      --dns-nameserver 8.8.8.8 \
      --flavor-id m1.small \
      --docker-volume-size 5 \
      --network-driver flannel \
      --coe kubernetes \
      --http-proxy http://proxy.yourcompany.net:8080/ \
      --https-proxy http://proxy.yourcompany.net:8080/
    Note
    Note
    1. Use the image_id from openstack image create command output in the previous step.

    2. Use your organization's DNS server. If the SUSE OpenStack Cloud public endpoint is configured with the hostname, this server should provide resolution for this hostname.

    3. The proxy is only needed if public internet (for example, https://discovery.etcd.io/ or https://gcr.io/) is not accessible without proxy.

  7. Create cluster. The command below will create a minimalistic cluster consisting of a single Kubernetes Master (kubemaster) and single Kubernetes Node (worker, kubeminion).

    $ magnum cluster-create --name my-cluster --cluster-template my-template --node-count 1 --master-count 1
  8. Immediately after issuing cluster-create command, cluster status should turn to CREATE_IN_PROGRESS and stack_id assigned.

    $ magnum cluster-show my-cluster
    +---------------------+------------------------------------------------------------+
    | Property            | Value                                                      |
    +---------------------+------------------------------------------------------------+
    | status              | CREATE_IN_PROGRESS                                         |
    | cluster_template_id | 245c6bf8-c609-4ea5-855a-4e672996cbbc                       |
    | uuid                | 0b78a205-8543-4589-8344-48b8cfc24709                       |
    | stack_id            | 22385a42-9e15-49d9-a382-f28acef36810                       |
    | status_reason       | -                                                          |
    | created_at          | 2017-04-10T21:25:11+00:00                                  |
    | name                | my-cluster                                                 |
    | updated_at          | -                                                          |
    | discovery_url       | https://discovery.etcd.io/193d122f869c497c2638021eae1ab0f7 |
    | api_address         | -                                                          |
    | coe_version         | -                                                          |
    | master_addresses    | []                                                         |
    | create_timeout      | 60                                                         |
    | node_addresses      | []                                                         |
    | master_count        | 1                                                          |
    | container_version   | -                                                          |
    | node_count          | 1                                                          |
    +---------------------+------------------------------------------------------------+
  9. You can monitor cluster creation progress by listing the resources of the heat stack. Use the stack_id value from the magnum cluster-status output above in the following command:

    $ heat resource-list -n2 22385a42-9e15-49d9-a382-f28acef36810
    WARNING (shell) "heat resource-list" is deprecated, please use "openstack stack resource list" instead
    +-------------------------------+--------------------------------------+-----------------------------------+--------------------+----------------------+-------------------------+
    | resource_name                 | physical_resource_id                 | resource_type                     | resource_status    | updated_time         | stack_name              |
    +-------------------------------+--------------------------------------+-----------------------------------+--------------------+----------------------+-------------------------+
    | api_address_floating_switch   | 06b2cc0d-77f9-4633-8d96-f51e2db1faf3 | Magnum::FloatingIPAddressSwitcher | CREATE_COMPLETE    | 2017-04-10T21:25:10Z | my-cluster-z4aquda2mgpv |
    | api_address_lb_switch         | 965124ca-5f62-4545-bbae-8d9cda7aff2e | Magnum::ApiGatewaySwitcher        | CREATE_COMPLETE    | 2017-04-10T21:25:10Z | my-cluster-z4aquda2mgpv |
    . . .
  10. The cluster is complete when all resources show CREATE_COMPLETE.

  11. Install kubectl onto your Cloud Lifecycle Manager.

    $ export https_proxy=http://proxy.yourcompany.net:8080
    $ wget https://storage.googleapis.com/kubernetes-release/release/v1.2.0/bin/linux/amd64/kubectl
    $ chmod +x ./kubectl
    $ sudo mv ./kubectl /usr/local/bin/kubectl
  12. Generate the cluster configuration using magnum cluster-config. If the CLI option --tls-disabled was not specified during cluster template creation, authentication in the cluster will be turned on. In this case, magnum cluster-config command will generate client authentication certificate (cert.pem) and key (key.pem). Copy and paste magnum cluster-config output to your command line input to finalize configuration (that is, export KUBECONFIG environment variable).

    $ mkdir my_cluster
    $ cd my_cluster
    /my_cluster $ ls
    /my_cluster $ magnum cluster-config my-cluster
    export KUBECONFIG=./config
    /my_cluster $ ls
    ca.pem cert.pem config key.pem
    /my_cluster $ export KUBECONFIG=./config
    /my_cluster $ kubectl version
    Client Version: version.Info{Major:"1", Minor:"2", GitVersion:"v1.2.0", GitCommit:"5cb86ee022267586db386f62781338b0483733b3", GitTreeState:"clean"}
    Server Version: version.Info{Major:"1", Minor:"2", GitVersion:"v1.2.0", GitCommit:"cffae0523cfa80ddf917aba69f08508b91f603d5", GitTreeState:"clean"}
  13. Create a simple Nginx replication controller, exposed as a service of type NodePort.

    $ cat >nginx.yml <<-EOF
    apiVersion: v1
    kind: ReplicationController
    metadata:
      name: nginx-controller
    spec:
      replicas: 1
      selector:
        app: nginx
      template:
        metadata:
          labels:
            app: nginx
        spec:
          containers:
            - name: nginx
              image: nginx
              ports:
                - containerPort: 80
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: nginx-service
    spec:
      type: NodePort
      ports:
      - port: 80
        nodePort: 30080
      selector:
        app: nginx
    EOF
    
    $ kubectl create -f nginx.yml
  14. Check pod status until it turns from Pending to Running.

    $ kubectl get pods
    NAME                      READY    STATUS     RESTARTS    AGE
    nginx-controller-5cmev    1/1      Running    0           2m
  15. Ensure that the Nginx welcome page is displayed at port 30080 using the kubemaster floating IP.

    $ http_proxy= curl http://172.31.0.6:30080
    <!DOCTYPE html>
    <html>
    <head>
    <title>Welcome to nginx!</title>

14.2 Deploying a Kubernetes Cluster on CoreOS

14.2.1 Prerequisites

These steps assume the following have been completed:

  • The Magnum service has been installed. For more information, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 26 “Magnum Overview”, Section 26.2 “Install the Magnum Service”.

  • Creating the Magnum cluster requires the CoreOS image for OpenStack. You can download compressed image file coreos_production_openstack_image.img.bz2 from http://stable.release.core-os.net/amd64-usr/current/.

14.2.2 Creating the Cluster

The following example is created using Kubernetes Container Orchestration Engine (COE) running on CoreOS guest OS on SUSE OpenStack Cloud VMs.

  1. Login to the Cloud Lifecycle Manager.

  2. Source openstack admin credentials.

    $ source service.osrc
  3. If you haven't already, download CoreOS image that is compatible for the OpenStack release.

    Note
    Note

    The https_proxy is only needed if your environment requires a proxy.

    $ export https_proxy=http://proxy.yourcompany.net:8080
    $ wget https://stable.release.core-os.net/amd64-usr/current/coreos_production_openstack_image.img.bz2
    $ bunzip2 coreos_production_openstack_image.img.bz2
  4. Create a glance image.

    $ openstack image create --name coreos-magnum --visibility public \
      --disk-format raw --os-distro coreos --container-format bare \
      --file coreos_production_openstack_image.img --progress
    [=============================>] 100%
    +------------------+--------------------------------------+
    | Property         | Value                                |
    +------------------+--------------------------------------+
    | checksum         | 4110469bb15af72ec0cf78c2da4268fa     |
    | container_format | bare                                 |
    | created_at       | 2017-04-25T18:10:52Z                 |
    | disk_format      | raw                                  |
    | id               | c25fc719-2171-437f-9542-fcb8a534fbd1 |
    | min_disk         | 0                                    |
    | min_ram          | 0                                    |
    | name             | coreos-magnum                        |
    | os_distro        | coreos                               |
    | owner            | 2f5b83ab49d54aaea4b39f5082301d09     |
    | protected        | False                                |
    | size             | 806551552                            |
    | status           | active                               |
    | tags             | []                                   |
    | updated_at       | 2017-04-25T18:11:07Z                 |
    | virtual_size     | None                                 |
    | visibility       | public                               |
    +------------------+--------------------------------------+
  5. Create a nova keypair.

    $ test -f ~/.ssh/id_rsa.pub || ssh-keygen -t rsa -N "" -f ~/.ssh/id_rsa
    $ openstack keypair create --pub-key ~/.ssh/id_rsa.pub testkey
  6. Create a Magnum cluster template.

    $ magnum cluster-template-create --name my-coreos-template \
      --image-id c25fc719-2171-437f-9542-fcb8a534fbd1 \
      --keypair-id testkey \
      --external-network-id ext-net \
      --dns-nameserver 8.8.8.8 \
      --flavor-id m1.small \
      --docker-volume-size 5 \
      --network-driver flannel \
      --coe kubernetes \
      --http-proxy http://proxy.yourcompany.net:8080/ \
      --https-proxy http://proxy.yourcompany.net:8080/
    Note
    Note
    1. Use the image_id from openstack image create command output in the previous step.

    2. Use your organization's DNS server. If the SUSE OpenStack Cloud public endpoint is configured with the hostname, this server should provide resolution for this hostname.

    3. The proxy is only needed if public internet (for example, https://discovery.etcd.io/ or https://gcr.io/) is not accessible without proxy.

  7. Create cluster. The command below will create a minimalistic cluster consisting of a single Kubernetes Master (kubemaster) and single Kubernetes Node (worker, kubeminion).

    $ magnum cluster-create --name my-coreos-cluster --cluster-template my-coreos-template --node-count 1 --master-count 1
  8. Almost immediately after issuing cluster-create command, cluster status should turn to CREATE_IN_PROGRESS and stack_id assigned.

    $ magnum cluster-show my-coreos-cluster
    +---------------------+------------------------------------------------------------+
    | Property            | Value                                                      |
    +---------------------+------------------------------------------------------------+
    | status              | CREATE_IN_PROGRESS                                         |
    | cluster_template_id | c48fa7c0-8dd9-4da4-b599-9e62dc942ca5                       |
    | uuid                | 6b85e013-f7c3-4fd3-81ea-4ea34201fd45                       |
    | stack_id            | c93f873a-d563-4721-9bd9-3bae2340750a                       |
    | status_reason       | -                                                          |
    | created_at          | 2017-04-25T22:38:43+00:00                                  |
    | name                | my-coreos-cluster                                          |
    | updated_at          | -                                                          |
    | discovery_url       | https://discovery.etcd.io/6e4c0e5ff5e5b9872173d06880886a0c |
    | api_address         | -                                                          |
    | coe_version         | -                                                          |
    | master_addresses    | []                                                         |
    | create_timeout      | 60                                                         |
    | node_addresses      | []                                                         |
    | master_count        | 1                                                          |
    | container_version   | -                                                          |
    | node_count          | 1                                                          |
    +---------------------+------------------------------------------------------------+
  9. You can monitor cluster creation progress by listing the resources of the heat stack. Use the stack_id value from the magnum cluster-status output above in the following command:

    $ heat resource-list -n2 c93f873a-d563-4721-9bd9-3bae2340750a
    WARNING (shell) "heat resource-list" is deprecated, please use "openstack stack resource list" instead
    +--------------------------------+-------------------------------------------------------------------------------------+-------------------------------------------------------------------
    ----------------------------------------------------------------+--------------------+----------------------+-------------------------------------------------------------------------+
    | resource_name                  | physical_resource_id                                                                | resource_type
                                                                    | resource_status    | updated_time         | stack_name                                                              |
    +--------------------------------+-------------------------------------------------------------------------------------+-------------------------------------------------------------------
    ----------------------------------------------------------------+--------------------+----------------------+-------------------------------------------------------------------------+
    | api_address_switch             |                                                                                     | Magnum::ApiGatewaySwitcher
                                                                    | INIT_COMPLETE      | 2017-04-25T22:38:42Z | my-coreos-cluster-mscybll54eoj                                          |
    . . .
  10. The cluster is complete when all resources show CREATE_COMPLETE.

  11. Install kubectl onto your Cloud Lifecycle Manager.

    $ export https_proxy=http://proxy.yourcompany.net:8080
    $ wget https://storage.googleapis.com/kubernetes-release/release/v1.2.0/bin/linux/amd64/kubectl
    $ chmod +x ./kubectl
    $ sudo mv ./kubectl /usr/local/bin/kubectl
  12. Generate the cluster configuration using magnum cluster-config. If the CLI option --tls-disabled was not specified during cluster template creation, authentication in the cluster will be turned on. In this case, magnum cluster-config command will generate client authentication certificate (cert.pem) and key (key.pem). Copy and paste magnum cluster-config output to your command line input to finalize configuration (that is, export KUBECONFIG environment variable).

    $ mkdir my_cluster
    $ cd my_cluster
    /my_cluster $ ls
    /my_cluster $ magnum cluster-config my-cluster
    export KUBECONFIG=./config
    /my_cluster $ ls
    ca.pem cert.pem config key.pem
    /my_cluster $ export KUBECONFIG=./config
    /my_cluster $ kubectl version
    Client Version: version.Info{Major:"1", Minor:"2", GitVersion:"v1.2.0", GitCommit:"5cb86ee022267586db386f62781338b0483733b3", GitTreeState:"clean"}
    Server Version: version.Info{Major:"1", Minor:"2", GitVersion:"v1.2.0", GitCommit:"cffae0523cfa80ddf917aba69f08508b91f603d5", GitTreeState:"clean"}
  13. Create a simple Nginx replication controller, exposed as a service of type NodePort.

    $ cat >nginx.yml <<-EOF
    apiVersion: v1
    kind: ReplicationController
    metadata:
      name: nginx-controller
    spec:
      replicas: 1
      selector:
        app: nginx
      template:
        metadata:
          labels:
            app: nginx
        spec:
          containers:
            - name: nginx
              image: nginx
              ports:
                - containerPort: 80
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: nginx-service
    spec:
      type: NodePort
      ports:
      - port: 80
        nodePort: 30080
      selector:
        app: nginx
    EOF
    
    $ kubectl create -f nginx.yml
  14. Check pod status until it turns from Pending to Running.

    $ kubectl get pods
    NAME                      READY    STATUS     RESTARTS    AGE
    nginx-controller-5cmev    1/1      Running    0           2m
  15. Ensure that the Nginx welcome page is displayed at port 30080 using the kubemaster floating IP.

    $ http_proxy= curl http://172.31.0.6:30080
    <!DOCTYPE html>
    <html>
    <head>
    <title>Welcome to nginx!</title>

14.3 Deploying a Docker Swarm Cluster on Fedora Atomic

14.3.1 Prerequisites

These steps assume the following have been completed:

  • The Magnum service has been installed. For more information, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 26 “Magnum Overview”, Section 26.2 “Install the Magnum Service”.

  • Deploying a Docker Swarm Cluster on Fedora Atomic requires the Fedora Atomic image fedora-atomic-26-20170723.0.x86_64.qcow2 prepared specifically for the OpenStack Pike release. You can download the fedora-atomic-26-20170723.0.x86_64.qcow2 image from https://download.fedoraproject.org/pub/alt/atomic/stable/Fedora-Atomic-26-20170723.0/CloudImages/x86_64/

14.3.2 Creating the Cluster

The following example is created using Kubernetes Container Orchestration Engine (COE) running on Fedora Atomic guest OS on SUSE OpenStack Cloud VMs.

  1. As stack user, login to the lifecycle manager.

  2. Source openstack admin credentials.

    $ source service.osrc
  3. If you haven't already, download Fedora Atomic image, prepared for Openstack Pike release.

    $ wget https://download.fedoraproject.org/pub/alt/atomic/stable/Fedora-Atomic-26-20170723.0/CloudImages/x86_64/images/Fedora-Atomic-26-20170723.0.x86_64.qcow2
  4. Create a glance image.

    $ openstack image create --name fedora-atomic-26-20170723.0.x86_64 --visibility public \
      --disk-format qcow2 --os-distro fedora-atomic --container-format bare \
      --file Fedora-Atomic-26-20170723.0.x86_64.qcow2 --progress
    [=============================>] 100%
    +------------------+--------------------------------------+
    | Property         | Value                                |
    +------------------+--------------------------------------+
    | checksum         | 9d233b8e7fbb7ea93f20cc839beb09ab     |
    | container_format | bare                                 |
    | created_at       | 2017-04-10T21:13:48Z                 |
    | disk_format      | qcow2                                |
    | id               | 4277115a-f254-46c0-9fb0-fffc45d2fd38 |
    | min_disk         | 0                                    |
    | min_ram          | 0                                    |
    | name             | fedora-atomic-26-20170723.0.x86_64   |
    | os_distro        | fedora-atomic                        |
    | owner            | 2f5b83ab49d54aaea4b39f5082301d09     |
    | protected        | False                                |
    | size             | 515112960                            |
    | status           | active                               |
    | tags             | []                                   |
    | updated_at       | 2017-04-10T21:13:56Z                 |
    | virtual_size     | None                                 |
    | visibility       | public                               |
    +------------------+--------------------------------------+
  5. Create a nova keypair.

    $ test -f ~/.ssh/id_rsa.pub || ssh-keygen -t rsa -N "" -f ~/.ssh/id_rsa
    $ openstack keypair create --pub-key ~/.ssh/id_rsa.pub testkey
  6. Create a Magnum cluster template.

    Note
    Note

    The --tls-disabled flag is not specified in the included template. Authentication via client certificate will be turned on in clusters created from this template.

    $  magnum cluster-template-create --name my-swarm-template \
      --image-id 4277115a-f254-46c0-9fb0-fffc45d2fd38 \
      --keypair-id testkey \
      --external-network-id ext-net \
      --dns-nameserver 8.8.8.8 \
      --flavor-id m1.small \
      --docker-volume-size 5 \
      --network-driver docker \
      --coe swarm \
      --http-proxy http://proxy.yourcompany.net:8080/ \
      --https-proxy http://proxy.yourcompany.net:8080/
    Note
    Note
    1. Use the image_id from openstack image create command output in the previous step.

    2. Use your organization's DNS server. If the SUSE OpenStack Cloud public endpoint is configured with the hostname, this server should provide resolution for this hostname.

    3. The proxy is only needed if public internet (for example, https://discovery.etcd.io/ or https://gcr.io/) is not accessible without proxy.

  7. Create cluster. The command below will create a minimalistic cluster consisting of a single Kubernetes Master (kubemaster) and single Kubernetes Node (worker, kubeminion).

    $ magnum cluster-create --name my-swarm-cluster --cluster-template my-swarm-template \
      --node-count 1 --master-count 1
  8. Immediately after issuing cluster-create command, cluster status should turn to CREATE_IN_PROGRESS and stack_id assigned.

    $ magnum cluster-show my-swarm-cluster
    +---------------------+------------------------------------------------------------+
    | Property            | Value                                                      |
    +---------------------+------------------------------------------------------------+
    | status              | CREATE_IN_PROGRESS                                         |
    | cluster_template_id | 17df266e-f8e1-4056-bdee-71cf3b1483e3                       |
    | uuid                | c3e13e5b-85c7-44f4-839f-43878fe5f1f8                       |
    | stack_id            | 3265d843-3677-4fed-bbb7-e0f56c27905a                       |
    | status_reason       | -                                                          |
    | created_at          | 2017-04-21T17:13:08+00:00                                  |
    | name                | my-swarm-cluster                                           |
    | updated_at          | -                                                          |
    | discovery_url       | https://discovery.etcd.io/54e83ea168313b0c2109d0f66cd0aa6f |
    | api_address         | -                                                          |
    | coe_version         | -                                                          |
    | master_addresses    | []                                                         |
    | create_timeout      | 60                                                         |
    | node_addresses      | []                                                         |
    | master_count        | 1                                                          |
    | container_version   | -                                                          |
    | node_count          | 1                                                          |
    +---------------------+------------------------------------------------------------+
  9. You can monitor cluster creation progress by listing the resources of the heat stack. Use the stack_id value from the magnum cluster-status output above in the following command:

    $ heat resource-list -n2 3265d843-3677-4fed-bbb7-e0f56c27905a
    WARNING (shell) "heat resource-list" is deprecated, please use "openstack stack resource list" instead
    +--------------------+--------------------------------------+--------------------------------------------+-----------------+----------------------+-------------------------------+
    | resource_name      | physical_resource_id                 | resource_type                              | resource_status | updated_time         | stack_name                    |
    |--------------------+--------------------------------------+--------------------------------------------+-----------------+----------------------+-------------------------------+
    | api_address_switch | 430f82f2-03e3-4085-8c07-b4a6b6d7e261 | Magnum::ApiGatewaySwitcher                 | CREATE_COMPLETE | 2017-04-21T17:13:07Z | my-swarm-cluster-j7gbjcxaremy |
    . . .
  10. The cluster is complete when all resources show CREATE_COMPLETE. You can also obtain the floating IP address once the cluster has been created.

    $ magnum cluster-show my-swarm-cluster
    +---------------------+------------------------------------------------------------+
    | Property            | Value                                                      |
    +---------------------+------------------------------------------------------------+
    | status              | CREATE_COMPLETE                                            |
    | cluster_template_id | 17df266e-f8e1-4056-bdee-71cf3b1483e3                       |
    | uuid                | c3e13e5b-85c7-44f4-839f-43878fe5f1f8                       |
    | stack_id            | 3265d843-3677-4fed-bbb7-e0f56c27905a                       |
    | status_reason       | Stack CREATE completed successfully                        |
    | created_at          | 2017-04-21T17:13:08+00:00                                  |
    | name                | my-swarm-cluster                                           |
    | updated_at          | 2017-04-21T17:18:26+00:00                                  |
    | discovery_url       | https://discovery.etcd.io/54e83ea168313b0c2109d0f66cd0aa6f |
    | api_address         | tcp://172.31.0.7:2376                                      |
    | coe_version         | 1.0.0                                                      |
    | master_addresses    | ['172.31.0.7']                                             |
    | create_timeout      | 60                                                         |
    | node_addresses      | ['172.31.0.5']                                             |
    | master_count        | 1                                                          |
    | container_version   | 1.9.1                                                      |
    | node_count          | 1                                                          |
    +---------------------+------------------------------------------------------------+
  11. Generate and sign client certificate using magnum cluster-config command.

    $ mkdir my_swarm_cluster
    $ cd my_swarm_cluster/
    ~/my_swarm_cluster $ magnum cluster-config my-swarm-cluster
    {'tls': True, 'cfg_dir': '.', 'docker_host': u'tcp://172.31.0.7:2376'}
    ~/my_swarm_cluster $ ls
    ca.pem  cert.pem  key.pem
  12. Copy generated certificates and key to ~/.docker folder on first cluster master node.

    $ scp -r ~/my_swarm_cluster fedora@172.31.0.7:.docker
    ca.pem                                             100% 1066     1.0KB/s   00:00
    key.pem                                            100% 1679     1.6KB/s   00:00
    cert.pem                                           100% 1005     1.0KB/s   00:00
  13. Login to first master node and set up cluster access environment variables.

    $ ssh fedora@172.31.0.7
    [fedora@my-6zxz5ukdu-0-bvqbsn2z2uwo-swarm-master-n6wfplu7jcwo ~]$ export DOCKER_TLS_VERIFY=1
    [fedora@my-6zxz5ukdu-0-bvqbsn2z2uwo-swarm-master-n6wfplu7jcwo ~]$ export DOCKER_HOST=tcp://172.31.0.7:2376
  14. Verfy that the swarm container is up and running.

    [fedora@my-6zxz5ukdu-0-bvqbsn2z2uwo-swarm-master-n6wfplu7jcwo ~]$ docker ps -a
    CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS               NAMES
    fcbfab53148c        swarm:1.0.0         "/swarm join --addr 1"   24 minutes ago      Up 24 minutes       2375/tcp            my-xggjts5zbgr-0-d4qhxhdujh4q-swarm-node-vieanhwdonon.novalocal/swarm-agent
  15. Deploy a sample docker application (nginx) and verify that Nginx is serving requests at port 8080 on worker node(s), on both floating and private IPs:

    [fedora@my-6zxz5ukdu-0-bvqbsn2z2uwo-swarm-master-n6wfplu7jcwo ~]$ docker run -itd -p 8080:80 nginx
    192030325fef0450b7b917af38da986edd48ac5a6d9ecb1e077b017883d18802
    
    [fedora@my-6zxz5ukdu-0-bvqbsn2z2uwo-swarm-master-n6wfplu7jcwo ~]$ docker port 192030325fef
    80/tcp -> 10.0.0.11:8080
    
    [fedora@my-6zxz5ukdu-0-bvqbsn2z2uwo-swarm-master-n6wfplu7jcwo ~]$ curl http://10.0.0.11:8080
    <!DOCTYPE html>
    <html>
    <head>
    <title>Welcome to nginx!</title>
    <style>
    ...
    [fedora@my-6zxz5ukdu-0-bvqbsn2z2uwo-swarm-master-n6wfplu7jcwo ~]$ curl http://172.31.0.5:8080
    <!DOCTYPE html>
    <html>
    <head>
    <title>Welcome to nginx!</title>
    <style>
    ...

14.4 Deploying an Apache Mesos Cluster on Ubuntu

14.4.1 Prerequisites

These steps assume the following have been completed:

  • The Magnum service has been installed. For more information, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 26 “Magnum Overview”, Section 26.2 “Install the Magnum Service”.

  • Deploying an Apache Mesos Cluster requires the Fedora Atomic image that is compatible for the OpenStack release. You can download the ubuntu-mesos-latest.qcow2 image from https://fedorapeople.org/groups/magnum/

14.4.2 Creating the Cluster

The following example is created using Kubernetes Container Orchestration Engine (COE) running on Fedora Atomic guest OS on SUSE OpenStack Cloud VMs.

  1. As stack user, login to the lifecycle manager.

  2. Source openstack admin credentials.

    $ source service.osrc
  3. If you haven't already, download Fedora Atomic image that is compatible for the OpenStack release.

    Note
    Note

    The https_proxy is only needed if your environment requires a proxy.

    $ https_proxy=http://proxy.yourcompany.net:8080 wget https://fedorapeople.org/groups/magnum/ubuntu-mesos-latest.qcow2
  4. Create a glance image.

    $ openstack image create --name ubuntu-mesos-latest --visibility public --disk-format qcow2 --os-distro ubuntu --container-format bare --file ubuntu-mesos-latest.qcow2 --progress
    [=============================>] 100%
    +------------------+--------------------------------------+
    | Property         | Value                                |
    +------------------+--------------------------------------+
    | checksum         | 97cc1fdb9ca80bf80dbd6842aab7dab5     |
    | container_format | bare                                 |
    | created_at       | 2017-04-21T19:40:20Z                 |
    | disk_format      | qcow2                                |
    | id               | d6a4e6f9-9e34-4816-99fe-227e0131244f |
    | min_disk         | 0                                    |
    | min_ram          | 0                                    |
    | name             | ubuntu-mesos-latest                  |
    | os_distro        | ubuntu                               |
    | owner            | 2f5b83ab49d54aaea4b39f5082301d09     |
    | protected        | False                                |
    | size             | 753616384                            |
    | status           | active                               |
    | tags             | []                                   |
    | updated_at       | 2017-04-21T19:40:32Z                 |
    | virtual_size     | None                                 |
    | visibility       | public                               |
    +------------------+--------------------------------------+
  5. Create a nova keypair.

    $ test -f ~/.ssh/id_rsa.pub || ssh-keygen -t rsa -N "" -f ~/.ssh/id_rsa
    $ openstack keypair create --pub-key ~/.ssh/id_rsa.pub testkey
  6. Create a Magnum cluster template.

    $ magnum cluster-template-create --name my-mesos-template \
      --image-id d6a4e6f9-9e34-4816-99fe-227e0131244f \
      --keypair-id testkey \
      --external-network-id ext-net \
      --dns-nameserver 8.8.8.8 \
      --flavor-id m1.small \
      --docker-volume-size 5 \
      --network-driver docker \
      --coe mesos \
      --http-proxy http://proxy.yourcompany.net:8080/ \
      --https-proxy http://proxy.yourcompany.net:8080/
    Note
    Note
    1. Use the image_id from openstack image create command output in the previous step.

    2. Use your organization's DNS server. If the SUSE OpenStack Cloud public endpoint is configured with the hostname, this server should provide resolution for this hostname.

    3. The proxy is only needed if public internet (for example, https://discovery.etcd.io/ or https://gcr.io/) is not accessible without proxy.

  7. Create cluster. The command below will create a minimalistic cluster consisting of a single Kubernetes Master (kubemaster) and single Kubernetes Node (worker, kubeminion).

    $ magnum cluster-create --name my-mesos-cluster --cluster-template my-mesos-template --node-count 1 --master-count 1
  8. Immediately after issuing cluster-create command, cluster status should turn to CREATE_IN_PROGRESS and stack_id assigned.

    $ magnum cluster-show my-mesos-cluster
    +---------------------+--------------------------------------+
    | Property            | Value                                |
    +---------------------+--------------------------------------+
    | status              | CREATE_IN_PROGRESS                   |
    | cluster_template_id | be354919-fa6c-4db8-9fd1-69792040f095 |
    | uuid                | b1493402-8571-4683-b81e-ddc129ff8937 |
    | stack_id            | 50aa20a6-bf29-4663-9181-cf7ba3070a25 |
    | status_reason       | -                                    |
    | created_at          | 2017-04-21T19:50:34+00:00            |
    | name                | my-mesos-cluster                     |
    | updated_at          | -                                    |
    | discovery_url       | -                                    |
    | api_address         | -                                    |
    | coe_version         | -                                    |
    | master_addresses    | []                                   |
    | create_timeout      | 60                                   |
    | node_addresses      | []                                   |
    | master_count        | 1                                    |
    | container_version   | -                                    |
    | node_count          | 1                                    |
    +---------------------+--------------------------------------+
  9. You can monitor cluster creation progress by listing the resources of the heat stack. Use the stack_id value from the magnum cluster-status output above in the following command:

    $ heat resource-list -n2 50aa20a6-bf29-4663-9181-cf7ba3070a25
    WARNING (shell) "heat resource-list" is deprecated, please use "openstack stack resource list" instead
    +------------------------------+--------------------------------------+-----------------------------------+-----------------+----------------------+-------------------------------+
    | resource_name                | physical_resource_id                 | resource_type                     | resource_status | updated_time         | stack_name                    |
    +------------------------------+--------------------------------------+-----------------------------------+-----------------+----------------------+-------------------------------+
    | add_proxy_master             | 10394a74-1503-44b4-969a-44258c9a7be1 | OS::heat::SoftwareConfig          | CREATE_COMPLETE | 2017-04-21T19:50:33Z | my-mesos-cluster-w2trq7m46qus |
    | add_proxy_master_deployment  |                                      | OS::heat::SoftwareDeploymentGroup | INIT_COMPLETE   | 2017-04-21T19:50:33Z | my-mesos-cluster-w2trq7m46qus |
    ...
  10. The cluster is complete when all resources show CREATE_COMPLETE.

    $ magnum cluster-show my-mesos-cluster
    +---------------------+--------------------------------------+
    | Property            | Value                                |
    +---------------------+--------------------------------------+
    | status              | CREATE_COMPLETE                      |
    | cluster_template_id | 9e942bfa-2c78-4837-82f5-6bea88ba1bf9 |
    | uuid                | 9d7bb502-8865-4cbd-96fa-3cd75f0f6945 |
    | stack_id            | 339a72b4-a131-47c6-8d10-365e6f6a18cf |
    | status_reason       | Stack CREATE completed successfully  |
    | created_at          | 2017-04-24T20:54:31+00:00            |
    | name                | my-mesos-cluster                     |
    | updated_at          | 2017-04-24T20:59:18+00:00            |
    | discovery_url       | -                                    |
    | api_address         | 172.31.0.10                          |
    | coe_version         | -                                    |
    | master_addresses    | ['172.31.0.10']                      |
    | create_timeout      | 60                                   |
    | node_addresses      | ['172.31.0.5']                       |
    | master_count        | 1                                    |
    | container_version   | 1.9.1                                |
    | node_count          | 1                                    |
    +---------------------+--------------------------------------+
  11. Verify that Marathon web console is available at http://${MASTER_IP}:8080/, and Mesos UI is available at http://${MASTER_IP}:5050/

    $ https_proxy=http://proxy.yourcompany.net:8080 curl -LO \
      https://storage.googleapis.com/kubernetes-release/release/v1.2.0/bin/linux/amd64/kubectl
    $ chmod +x ./kubectl
    $ sudo mv ./kubectl /usr/local/bin/kubectl
  12. Create an example Mesos application.

    $ mkdir my_mesos_cluster
    $ cd my_mesos_cluster/
    $ cat > sample.json <<-EOFc
    {
      "id": "sample",
      "cmd": "python3 -m http.server 8080",
      "cpus": 0.5,
      "mem": 32.0,
      "container": {
        "type": "DOCKER",
        "docker": {
          "image": "python:3",
          "network": "BRIDGE",
          "portMappings": [
            { "containerPort": 8080, "hostPort": 0 }
          ]
        }
      }
    }
    EOF
    $ curl -s -X POST -H "Content-Type: application/json" \
      http://172.31.0.10:8080/v2/apps -d@sample.json | json_pp
    {
       "dependencies" : [],
       "healthChecks" : [],
       "user" : null,
       "mem" : 32,
       "requirePorts" : false,
       "tasks" : [],
       "cpus" : 0.5,
       "upgradeStrategy" : {
          "minimumHealthCapacity" : 1,
          "maximumOverCapacity" : 1
       },
       "maxLaunchDelaySeconds" : 3600,
       "disk" : 0,
       "constraints" : [],
       "executor" : "",
       "cmd" : "python3 -m http.server 8080",
       "id" : "/sample",
       "labels" : {},
       "ports" : [
          0
       ],
       "storeUrls" : [],
       "instances" : 1,
       "tasksRunning" : 0,
       "tasksHealthy" : 0,
       "acceptedResourceRoles" : null,
       "env" : {},
       "tasksStaged" : 0,
       "tasksUnhealthy" : 0,
       "backoffFactor" : 1.15,
       "version" : "2017-04-25T16:37:40.657Z",
       "uris" : [],
       "args" : null,
       "container" : {
          "volumes" : [],
          "docker" : {
             "portMappings" : [
                {
                   "containerPort" : 8080,
                   "hostPort" : 0,
                   "servicePort" : 0,
                   "protocol" : "tcp"
                }
             ],
             "parameters" : [],
             "image" : "python:3",
             "forcePullImage" : false,
             "network" : "BRIDGE",
             "privileged" : false
          },
          "type" : "DOCKER"
       },
       "deployments" : [
          {
             "id" : "6fbe48f0-6a3c-44b7-922e-b172bcae1be8"
          }
       ],
       "backoffSeconds" : 1
    }
  13. Wait for sample application to start. Use REST API or Marathon web console to monitor status:

    $ curl -s http://172.31.0.10:8080/v2/apps/sample | json_pp
    {
       "app" : {
          "deployments" : [],
          "instances" : 1,
          "tasks" : [
             {
                "id" : "sample.7fdd1ee4-29d5-11e7-9ee0-02427da4ced1",
                "stagedAt" : "2017-04-25T16:37:40.807Z",
                "version" : "2017-04-25T16:37:40.657Z",
                "ports" : [
                   31827
                ],
                "appId" : "/sample",
                "slaveId" : "21444bc5-3eb8-49cd-b020-77041e0c88d0-S0",
                "host" : "10.0.0.9",
                "startedAt" : "2017-04-25T16:37:42.003Z"
             }
          ],
          "upgradeStrategy" : {
             "maximumOverCapacity" : 1,
             "minimumHealthCapacity" : 1
          },
          "storeUrls" : [],
          "requirePorts" : false,
          "user" : null,
          "id" : "/sample",
          "acceptedResourceRoles" : null,
          "tasksRunning" : 1,
          "cpus" : 0.5,
          "executor" : "",
          "dependencies" : [],
          "args" : null,
          "backoffFactor" : 1.15,
          "ports" : [
             10000
          ],
          "version" : "2017-04-25T16:37:40.657Z",
          "container" : {
             "volumes" : [],
             "docker" : {
                "portMappings" : [
                   {
                      "servicePort" : 10000,
                      "protocol" : "tcp",
                      "hostPort" : 0,
                      "containerPort" : 8080
                   }
                ],
                "forcePullImage" : false,
                "parameters" : [],
                "image" : "python:3",
                "privileged" : false,
                "network" : "BRIDGE"
             },
             "type" : "DOCKER"
          },
          "constraints" : [],
          "tasksStaged" : 0,
          "env" : {},
          "mem" : 32,
          "disk" : 0,
          "labels" : {},
          "tasksHealthy" : 0,
          "healthChecks" : [],
          "cmd" : "python3 -m http.server 8080",
          "backoffSeconds" : 1,
          "maxLaunchDelaySeconds" : 3600,
          "versionInfo" : {
             "lastConfigChangeAt" : "2017-04-25T16:37:40.657Z",
             "lastScalingAt" : "2017-04-25T16:37:40.657Z"
          },
          "uris" : [],
          "tasksUnhealthy" : 0
       }
    }
  14. Verify that deployed application is responding on automatically assigned port on floating IP address of worker node.

    $ curl http://172.31.0.5:31827
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
    <html>
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    <title>Directory listing for /</title>
    ...

14.5 Creating a Magnum Cluster with the Dashboard

You can alternatively create a cluster template and cluster with the Magnum UI in horizon. The example instructions below demonstrate how to deploy a Kubernetes Cluster using the Fedora Atomic image. Other deployments such as Kubernetes on CoreOS, Docker Swarm on Fedora, and Mesos on Ubuntu all follow the same set of instructions mentioned below with slight variations to their parameters. You can determine those parameters by looking at the previous set of CLI instructions in the magnum cluster-template-create and magnum cluster-create commands.

14.5.1 Prerequisites

  • Magnum must be installed before proceeding. For more information, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 26 “Magnum Overview”, Section 26.2 “Install the Magnum Service”.

    Important
    Important

    Pay particular attention to external-name: in data/network_groups.yml. This cannot be set to the default myardana.test and must be a valid DNS-resolvable FQDN. If you do not have a DNS-resolvable FQDN, remove or comment out the external-name entry and the public endpoint will use an IP address instead of a name.

  • The image for which you want to base your cluster on must already have been uploaded into glance. See the previous CLI instructions regarding deploying a cluster on how this is done.

14.5.2 Creating the Cluster Template

You will need access to the Dashboard to create the cluster template.

  1. Open a web browser that has both JavaScript and cookies enabled. In the address bar, enter the host name or IP address for the dashboard.

  2. On the Log In page, enter your user name and password and then click Connect.

  3. Make sure you are in the appropriate domain and project in the left pane. Below is an example image of the drop-down box:

    Image
  4. A key pair is required for cluster template creation. It is applied to VMs created during the cluster creation process. This allows SSH access to your cluster's VMs. If you would like to create a new key pair, do so by going to Project › Compute › Access & Security › Key Pairs.

  5. Go to Project › Container Infra › Cluster Templates. Insert CLUSTER_NAME and click on + Create Cluster Template with the following options:

    Image
    • Public - makes the template available for others to use.

    • Enable Registry - creates and uses a private docker registry backed by OpenStack swift in addition to using the public docker registry.

    • Disable TLS - turns off TLS encryption. For Kubernetes clusters which use client certificate authentication, disabling TLS also involves disabling authentication.

    Image
    Image
    • Proxies are only needed if the created VMs require a proxy to connect externally.

    • Master LB – This should be turned off; LbaaS v2 (Octavia) is not available in SUSE OpenStack Cloud.

    • Floating IP – This assigns floating IPs to the cluster nodes when the cluster is being created. This should be selected if you wish to ssh into the cluster nodes, perform diagnostics and additional tuning to Kubernetes.

  6. Click the Submit button to create the cluster template and you should see my-template in the list of templates.

14.5.3 Creating the Cluster

  1. Click Create Cluster for my-template or go to Project › Container Infra › Clusters and click + Create Cluster with the following options.

    Image
  2. Click Create to start the cluster creation process.

  3. Click Clusters in the left pane to see the list of clusters. You will see my-cluster in this list. If you select my-cluster, you will see additional information regarding your cluster.

    Image

15 System Maintenance

This section contains the following subsections to help you manage, configure, and maintain your SUSE OpenStack Cloud cloud as well as procedures for performing node maintenance.

15.1 Planned System Maintenance

Planned maintenance tasks for your cloud. See sections below for:

15.1.1 Whole Cloud Maintenance

Planned maintenance procedures for your whole cloud.

15.1.1.1 Bringing Down Your Cloud: Services Down Method

Important
Important

If you have a planned maintenance and need to bring down your entire cloud, update and reboot all nodes in the cloud one by one. Start with the deployer node, then follow the order recommended in Section 15.1.1.2, “Rolling Reboot of the Cloud”. This method will bring down all of your services.

If you wish to use a method utilizing rolling reboots where your cloud services will continue running then see Section 15.1.1.2, “Rolling Reboot of the Cloud”.

To perform backups prior to these steps, visit the backup and restore pages first at Chapter 17, Backup and Restore.

15.1.1.1.1 Gracefully Bringing Down and Restarting Your Cloud Environment

You will do the following steps from your Cloud Lifecycle Manager.

  1. Log in to your Cloud Lifecycle Manager.

  2. Gracefully shut down your cloud by running the ardana-stop.yml playbook:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts ardana-stop.yml
  3. Shut down and restart your nodes. There are multiple ways you can do this:

    1. You can SSH to each node and use sudo reboot -f to reboot the node. Reboot the control plane nodes first so that they become functional as early as possible.

    2. You can shut down the nodes and then physically restart them either via a power button or the IPMI. If your cloud data model servers.yml specifies iLO connectivity for all nodes, then you can use the bm-power-down.yml and bm-power-up.yml playbooks on the Cloud Lifecycle Manager.

      Power down the control plane nodes last so that they remain online as long as possible, and power them back up before other nodes to restore their services quickly.

  4. Perform the necessary maintenance.

  5. After the maintenance is complete, power your Cloud Lifecycle Manager back up and then SSH to it.

  6. Determine the current power status of the nodes in your environment:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts bm-power-status.yml
  7. If necessary, power up any nodes that are not already powered up, ensuring that you power up your controller nodes first. You can target specific nodes with the -e nodelist=<node_name> switch.

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts bm-power-up.yml [-e nodelist=<node_name>]
    Note
    Note

    Obtain the <node_name> by using the sudo cobbler system list command from the Cloud Lifecycle Manager.

  8. Bring the databases back up:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts galera-bootstrap.yml
  9. Gracefully bring up your cloud services by running the ardana-start.yml playbook:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts ardana-start.yml
  10. Pause for a few minutes and give the cloud environment time to come up completely and then verify the status of the individual services using this playbook:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts ardana-status.yml
  11. If any services did not start properly, you can run playbooks for the specific services having issues.

    For example:

    If RabbitMQ fails, run:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts rabbitmq-start.yml

    You can check the status of RabbitMQ afterwards with this:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts rabbitmq-status.yml

    If the recovery had failed, you can run:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts rabbitmq-disaster-recovery.yml

    Each of the other services have playbooks in the ~/scratch/ansible/next/ardana/ansible directory in the format of <service>-start.yml that you can run. One example, for the compute service, is nova-start.yml.

  12. Continue checking the status of your SUSE OpenStack Cloud 9 cloud services until there are no more failed or unreachable nodes:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts ardana-status.yml

15.1.1.2 Rolling Reboot of the Cloud

If you have a planned maintenance and need to bring down your entire cloud and restart services while minimizing downtime, follow the steps here to safely restart your cloud. If you do not mind your services being down, then another option for planned maintenance can be found at Section 15.1.1.1, “Bringing Down Your Cloud: Services Down Method”.

15.1.1.2.1 Recommended node reboot order

To ensure that rebooted nodes reintegrate into the cluster, the key is having enough time between controller reboots.

The recommended way to achieve this is as follows:

  1. Reboot controller nodes one-by-one with a suitable interval in between. If you alternate between controllers and compute nodes you will gain more time between the controller reboots.

  2. Reboot of compute nodes (if present in your cloud).

  3. Reboot of swift nodes (if present in your cloud).

  4. Reboot of ESX nodes (if present in your cloud).

15.1.1.2.2 Rebooting controller nodes

Turn off the keystone Fernet Token-Signing Key Rotation

Before rebooting any controller node, you need to ensure that the keystone Fernet token-signing key rotation is turned off. Run the following command:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts keystone-stop-fernet-auto-rotation.yml

Migrate singleton services first

Note
Note

If you have previously rebooted your Cloud Lifecycle Manager for any reason, ensure that the apache2 service is running before continuing. To start the apache2 service, use this command:

ardana > sudo systemctl start apache2

The first consideration before rebooting any controller nodes is that there are a few services that run as singletons (non-HA), thus they will be unavailable while the controller they run on is down. Typically this is a very small window, but if you want to retain the service during the reboot of that server you should take special action to maintain service, such as migrating the service.

For these steps, if your singleton services are running on controller1 and you move them to controller2, then ensure you move them back to controller1 before proceeding to reboot controller2.

For the cinder-volume singleton service:

Execute the following command on each controller node to determine which node is hosting the cinder-volume singleton. It should be running on only one node:

ardana > ps auxww | grep cinder-volume | grep -v grep

Run the cinder-migrate-volume.yml playbook - details about the cinder volume and backup migration instructions can be found in Section 8.1.3, “Managing cinder Volume and Backup Services”.

For the SNAT namespace singleton service:

If you reboot the controller node hosting the SNAT namespace service on it, Compute instances without floating IPs will lose network connectivity when that controller is rebooted. To prevent this from happening, you can use these steps to determine which controller node is hosting the SNAT namespace service and migrate it to one of the other controller nodes while that node is rebooted.

  1. Locate the SNAT node where the router is providing the active snat_service:

    1. From the Cloud Lifecycle Manager, list out your ports to determine which port is serving as the router gateway:

      ardana > source ~/service.osrc
      ardana > openstack port list --device_owner network:router_gateway

      Example:

      $ openstack port list --device_owner network:router_gateway
      +--------------------------------------+------+-------------------+-------------------------------------------------------------------------------------+
      | id                                   | name | mac_address       | fixed_ips                                                                           |
      +--------------------------------------+------+-------------------+-------------------------------------------------------------------------------------+
      | 287746e6-7d82-4b2c-914c-191954eba342 |      | fa:16:3e:2e:26:ac | {"subnet_id": "f4152001-2500-4ebe-ba9d-a8d6149a50df", "ip_address": "10.247.96.29"} |
      +--------------------------------------+------+-------------------+-------------------------------------------------------------------------------------+
    2. Look at the details of this port to determine what the binding:host_id value is, which will point to the host in which the port is bound to:

      openstack port show <port_id>

      Example, with the value you need in bold:

      ardana > openstack port show 287746e6-7d82-4b2c-914c-191954eba342
      +-----------------------+--------------------------------------------------------------------------------------------------------------+
      | Field                 | Value                                                                                                        |
      +-----------------------+--------------------------------------------------------------------------------------------------------------+
      | admin_state_up        | True                                                                                                         |
      | allowed_address_pairs |                                                                                                              |
      | binding:host_id       | ardana-cp1-c1-m2-mgmt                                                                                        |
      | binding:profile       | {}                                                                                                           |
      | binding:vif_details   | {"port_filter": true, "ovs_hybrid_plug": true}                                                               |
      | binding:vif_type      | ovs                                                                                                          |
      | binding:vnic_type     | normal                                                                                                       |
      | device_id             | e122ea3f-90c5-4662-bf4a-3889f677aacf                                                                         |
      | device_owner          | network:router_gateway                                                                                       |
      | dns_assignment        | {"hostname": "host-10-247-96-29", "ip_address": "10.247.96.29", "fqdn": "host-10-247-96-29.openstacklocal."} |
      | dns_name              |                                                                                                              |
      | extra_dhcp_opts       |                                                                                                              |
      | fixed_ips             | {"subnet_id": "f4152001-2500-4ebe-ba9d-a8d6149a50df", "ip_address": "10.247.96.29"}                          |
      | id                    | 287746e6-7d82-4b2c-914c-191954eba342                                                                         |
      | mac_address           | fa:16:3e:2e:26:ac                                                                                            |
      | name                  |                                                                                                              |
      | network_id            | d3cb12a6-a000-4e3e-82c4-ee04aa169291                                                                         |
      | security_groups       |                                                                                                              |
      | status                | DOWN                                                                                                         |
      | tenant_id             |                                                                                                              |
      +-----------------------+--------------------------------------------------------------------------------------------------------------+

      In this example, the ardana-cp1-c1-m2-mgmt is the node hosting the SNAT namespace service.

  2. SSH to the node hosting the SNAT namespace service and check the SNAT namespace, specifying the router_id that has the interface with the subnet that you are interested in:

    ardana > ssh <IP_of_SNAT_namespace_host>
    ardana > sudo ip netns exec snat-<router_ID> bash

    Example:

    ardana > sudo ip netns exec snat-e122ea3f-90c5-4662-bf4a-3889f677aacf bash
  3. Obtain the ID for the L3 Agent for the node hosting your SNAT namespace:

    ardana > source ~/service.osrc
    ardana > openstack network agent list

    Example, with the entry you need given the examples above:

    ardana > openstack network agent list
    +--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+
    | id                                   | agent_type           | host                     | alive | admin_state_up | binary                    |
    +--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+
    | 0126bbbf-5758-4fd0-84a8-7af4d93614b8 | DHCP agent           | ardana-cp1-c1-m3-mgmt    | :-)   | True           | neutron-dhcp-agent        |
    | 33dec174-3602-41d5-b7f8-a25fd8ff6341 | Metadata agent       | ardana-cp1-c1-m2-mgmt    | :-)   | True           | neutron-metadata-agent    |
    | 3bc28451-c895-437b-999d-fdcff259b016 | L3 agent             | ardana-cp1-c1-m1-mgmt    | :-)   | True           | neutron-vpn-agent         |
    | 4af1a941-61c1-4e74-9ec1-961cebd6097b | L3 agent             | ardana-cp1-comp0001-mgmt | :-)   | True           | neutron-l3-agent          |
    | 65bcb3a0-4039-4d9d-911c-5bb790953297 | Open vSwitch agent   | ardana-cp1-c1-m1-mgmt    | :-)   | True           | neutron-openvswitch-agent |
    | 6981c0e5-5314-4ccd-bbad-98ace7db7784 | L3 agent             | ardana-cp1-c1-m3-mgmt    | :-)   | True           | neutron-vpn-agent         |
    | 7df9fa0b-5f41-411f-a532-591e6db04ff1 | Metadata agent       | ardana-cp1-comp0001-mgmt | :-)   | True           | neutron-metadata-agent    |
    | 92880ab4-b47c-436c-976a-a605daa8779a | Metadata agent       | ardana-cp1-c1-m1-mgmt    | :-)   | True           | neutron-metadata-agent    |
    | a209c67d-c00f-4a00-b31c-0db30e9ec661 | L3 agent             | ardana-cp1-c1-m2-mgmt    | :-)   | True           | neutron-vpn-agent         |
    | a9467f7e-ec62-4134-826f-366292c1f2d0 | DHCP agent           | ardana-cp1-c1-m1-mgmt    | :-)   | True           | neutron-dhcp-agent        |
    | b13350df-f61d-40ec-b0a3-c7c647e60f75 | Open vSwitch agent   | ardana-cp1-c1-m2-mgmt    | :-)   | True           | neutron-openvswitch-agent |
    | d4c07683-e8b0-4a2b-9d31-b5b0107b0b31 | Open vSwitch agent   | ardana-cp1-comp0001-mgmt | :-)   | True           | neutron-openvswitch-agent |
    | e91d7f3f-147f-4ad2-8751-837b936801e3 | Open vSwitch agent   | ardana-cp1-c1-m3-mgmt    | :-)   | True           | neutron-openvswitch-agent |
    | f33015c8-f4e4-4505-b19b-5a1915b6e22a | DHCP agent           | ardana-cp1-c1-m2-mgmt    | :-)   | True           | neutron-dhcp-agent        |
    | fe43c0e9-f1db-4b67-a474-77936f7acebf | Metadata agent       | ardana-cp1-c1-m3-mgmt    | :-)   | True           | neutron-metadata-agent    |
    +--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+
  4. Also obtain the ID for the L3 Agent of the node you are going to move the SNAT namespace service to using the same commands as the previous step.

  5. Use these commands to move the SNAT namespace service, with the router_id being the same value as the ID for router:

    1. Remove the L3 Agent for the old host:

      ardana > openstack network agent remove router –agent-type l3
      <agent_id_of_snat_namespace_host> \
      <qrouter_uuid>

      Example:

      ardana > openstack network agent remove router –agent-type l3
      a209c67d-c00f-4a00-b31c-0db30e9ec661 \
      e122ea3f-90c5-4662-bf4a-3889f677aacf
      Removed router e122ea3f-90c5-4662-bf4a-3889f677aacf from L3 agent
    2. Remove the SNAT namespace:

      ardana > sudo ip netns delete snat-<router_id>

      Example:

      ardana > sudo ip netns delete snat-e122ea3f-90c5-4662-bf4a-3889f677aacf
    3. Create a new L3 Agent for the new host:

      ardana > openstack network agent add router –agent-type l3
      <agent_id_of_new_snat_namespace_host> \
      <qrouter_uuid>

      Example:

      ardana > openstack network agent add router –agent-type l3
      3bc28451-c895-437b-999d-fdcff259b016 \
      e122ea3f-90c5-4662-bf4a-3889f677aacf
      Added router e122ea3f-90c5-4662-bf4a-3889f677aacf to L3 agent

    Confirm that it has been moved by listing the details of your port from step 1b above, noting the value of binding:host_id which should be updated to the host you moved your SNAT namespace to:

    ardana > openstack port show <port_ID>

    Example:

    ardana > openstack port show 287746e6-7d82-4b2c-914c-191954eba342
    +-----------------------+--------------------------------------------------------------------------------------------------------------+
    | Field                 | Value                                                                                                        |
    +-----------------------+--------------------------------------------------------------------------------------------------------------+
    | admin_state_up        | True                                                                                                         |
    | allowed_address_pairs |                                                                                                              |
    | binding:host_id       | ardana-cp1-c1-m1-mgmt                                                                                        |
    | binding:profile       | {}                                                                                                           |
    | binding:vif_details   | {"port_filter": true, "ovs_hybrid_plug": true}                                                               |
    | binding:vif_type      | ovs                                                                                                          |
    | binding:vnic_type     | normal                                                                                                       |
    | device_id             | e122ea3f-90c5-4662-bf4a-3889f677aacf                                                                         |
    | device_owner          | network:router_gateway                                                                                       |
    | dns_assignment        | {"hostname": "host-10-247-96-29", "ip_address": "10.247.96.29", "fqdn": "host-10-247-96-29.openstacklocal."} |
    | dns_name              |                                                                                                              |
    | extra_dhcp_opts       |                                                                                                              |
    | fixed_ips             | {"subnet_id": "f4152001-2500-4ebe-ba9d-a8d6149a50df", "ip_address": "10.247.96.29"}                          |
    | id                    | 287746e6-7d82-4b2c-914c-191954eba342                                                                         |
    | mac_address           | fa:16:3e:2e:26:ac                                                                                            |
    | name                  |                                                                                                              |
    | network_id            | d3cb12a6-a000-4e3e-82c4-ee04aa169291                                                                         |
    | security_groups       |                                                                                                              |
    | status                | DOWN                                                                                                         |
    | tenant_id             |                                                                                                              |
    +-----------------------+--------------------------------------------------------------------------------------------------------------+

Reboot the controllers

In order to reboot the controller nodes, you must first retrieve a list of nodes in your cloud running control plane services.

ardana > for i in $(grep -w cluster-prefix
~/openstack/my_cloud/definition/data/control_plane.yml \
| awk '{print $2}'); do grep $i
~/scratch/ansible/next/ardana/ansible/hosts/verb_hosts \
| grep ansible_ssh_host | awk '{print $1}'; done

Then perform the following steps from your Cloud Lifecycle Manager for each of your controller nodes:

  1. If any singleton services are active on this node, they will be unavailable while the node is down. If you want to retain the service during the reboot, you should take special action to maintain the service, such as migrating the service as appropriate as noted above.

  2. Stop all services on the controller node that you are rebooting first:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-stop.yml --limit \
    <controller node>
  3. Reboot the controller node, e.g. run the following command on the controller itself:

    ardana > sudo reboot

    Note that the current node being rebooted could be hosting the lifecycle manager.

  4. Wait for the controller node to become ssh-able and allow an additional minimum of five minutes for the controller node to settle. Start all services on the controller node:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-start.yml \
    --limit <controller node>
  5. Verify that the status of all services on that is OK on the controller node:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-status.yml \
    --limit <controller node>
  6. When above start operation has completed successfully, you may proceed to the next controller node. Ensure that you migrate your singleton services off the node first.

Note
Note

It is important that you not begin the reboot procedure for a new controller node until the reboot of the previous controller node has been completed successfully (that is, the ardana-status playbook has completed without error).

Reenable the keystone Fernet Token-Signing Key Rotation

After all the controller nodes are successfully updated and back online, you need to re-enable the keystone Fernet token-signing key rotation job by running the keystone-reconfigure.yml playbook. On the deployer, run:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts keystone-reconfigure.yml
15.1.1.2.3 Rebooting compute nodes

To reboot a compute node the following operations will need to be performed:

  • Disable provisioning of the node to take the node offline to prevent further instances being scheduled to the node during the reboot.

  • Identify instances that exist on the compute node, and then either:

    • Live migrate the instances off the node before actioning the reboot. OR

    • Stop the instances

  • Reboot the node

  • Restart the nova services

  1. Disable provisioning:

    ardana > openstack compute service set --disable --disable-reason "DESCRIBE REASON" compute nova-compute

    If the node has existing instances running on it these instances will need to be migrated or stopped prior to re-booting the node.

  2. Live migrate existing instances. Identify the instances on the compute node. Note: The following command must be run with nova admin credentials.

    ardana > openstack server list --host <hostname> --all-tenants
  3. Migrate or Stop the instances on the compute node.

    Migrate the instances off the node by running one of the following commands for each of the instances:

    If your instance is booted from a volume and has any number of cinder volume attached, use the nova live-migration command:

    ardana > nova live-migration <instance uuid> [<target compute host>]

    If your instance has local (ephemeral) disk(s) only, you can use the --block-migrate option:

    ardana > nova live-migration --block-migrate <instance uuid> [<target compute host>]

    Note: The [<target compute host>] option is optional. If you do not specify a target host then the nova scheduler will choose a node for you.

    OR

    Stop the instances on the node by running the following command for each of the instances:

    ardana > openstack server stop <instance-uuid>
  4. Stop all services on the Compute node:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-stop.yml --limit <compute node>
  5. SSH to your Compute nodes and reboot them:

    ardana > sudo reboot

    The operating system cleanly shuts down services and then automatically reboots. If you want to be very thorough, run your backup jobs just before you reboot.

  6. Run the ardana-start.yml playbook from the Cloud Lifecycle Manager. If needed, use the bm-power-up.yml playbook to restart the node. Specify just the node(s) you want to start in the 'nodelist' parameter arguments, that is, nodelist=<node1>[,<node2>][,<node3>].

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost bm-power-up.yml -e nodelist=<compute node>
  7. Execute the ardana-start.yml playbook. Specifying the node(s) you want to start in the 'limit' parameter arguments. This parameter accepts wildcard arguments and also '@<filename>' to process all hosts listed in the file.

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-start.yml --limit <compute node>
  8. Re-enable provisioning on the node:

    ardana > openstack compute service set --enable compute nova-compute
  9. Restart any instances you stopped.

    ardana > openstack server start <instance-uuid>
15.1.1.2.4 Rebooting swift nodes

If your swift services are on controller node, please follow the controller node reboot instructions above.

For a dedicated swift PAC cluster or swift Object resource node:

For each swift host

  1. Stop all services on the swift node:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-stop.yml --limit <swift node>
  2. Reboot the swift node by running the following command on the swift node itself:

    ardana > sudo reboot
  3. Wait for the node to become ssh-able and then start all services on the swift node:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-start.yml --limit <swift node>
15.1.1.2.5 Get list of status playbooks

The following command will display a list of status playbooks:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ls *status*

15.1.2 Planned Control Plane Maintenance

Planned maintenance tasks for controller nodes such as full cloud reboots and replacing controller nodes.

15.1.2.1 Replacing a Controller Node

This section outlines steps for replacing a controller node in your environment.

For SUSE OpenStack Cloud, you must have three controller nodes. Therefore, adding or removing nodes is not an option. However, if you need to repair or replace a controller node, you may do so by following the steps outlined here. Note that to run any playbooks whatsoever for cloud maintenance, you will always run the steps from the Cloud Lifecycle Manager.

These steps will depend on whether you need to replace a shared lifecycle manager/controller node or whether this is a standalone controller node.

Keep in mind while performing the following tasks:

  • Do not add entries for a new server. Instead, update the entries for the broken one.

  • Be aware that all management commands are run on the node where the Cloud Lifecycle Manager is running.

15.1.2.1.1 Replacing a Shared Cloud Lifecycle Manager/Controller Node

If the controller node you need to replace was also being used as your Cloud Lifecycle Manager then use these steps below. If this is not a shared controller, skip to the next section.

  1. To ensure that you use the same version of SUSE OpenStack Cloud that you previously had loaded on your Cloud Lifecycle Manager, you will need to download and install the lifecycle management software using the instructions from the installation guide:

    Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 15 “Installing the Cloud Lifecycle Manager server”, Section 15.5.2 “Installing the SUSE OpenStack Cloud Extension”

  2. Initialize the Cloud Lifecycle Manager platform by running ardana-init.

  3. To restore your data, see Section 15.2.3.2.3, “Point-in-time Cloud Lifecycle Manager Recovery”. At this time, restore only the backup of ardana files on the system into /var/lib/ardana (the user's home directory.)

  4. On the new node, update your cloud model with the new mac-addr, ilo-ip, ilo-password, and ilo-user fields to reflect the attributes of the node. Do not change the id, ip-addr, role, or server-group settings.

    Note
    Note

    When imaging servers with your own tooling, it is still necessary to have ILO/IPMI settings for all nodes. Even if you are not using Cobbler, the username and password fields in servers.yml need to be filled in with dummy settings. For example, add the following to servers.yml:

    ilo-user: manual
    ilo-password: deployment
  5. Open the servers.yml file describing your cloud nodes:

    ardana > git -C ~/openstack checkout site
    ardana > cd ~/openstack/my_cloud/definition/data
    ardana > vi servers.yml

    Change, as necessary, the mac-addr, ilo-ip, ilo-password, and ilo-user fields of the existing controller node. Save and commit the change:

    ardana > git commit -a -m "repaired node X"
  6. Run the configuration processor and ready-deployment playbooks as follows:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  7. Run the wipe_disks.yml playbook to ensure all non-OS partitions on the new node are completely wiped prior to continuing with the installation. (The value to be used for hostname is the host's identifier from ~/scratch/ansible/next/ardana/ansible/hosts/verb_hosts.)

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts wipe_disks.yml --limit deployer_node_name

    The value for deployer_node_name should be the name identifying the deployer/controller being initialized as it is represented in the hosts/verb_hosts file.

  8. Deploy Cobbler:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost cobbler-deploy.yml
  9. Refer again to Section 15.2.3.2.3, “Point-in-time Cloud Lifecycle Manager Recovery” and proceed to restore all remaining backups, with the exclusion of /var/lib/ardana (which was done earlier) and the cobbler content in /var/lib/cobbler and /srv/www/cobbler.

  10. Install the software on your new Cloud Lifecycle Manager/controller node with these playbooks:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-rebuild-pretasks.yml
    ardana > ansible-playbook -i hosts/verb_hosts osconfig-run.yml \
    -e rebuild=True --limit deployer_node_name
    ardana > ansible-playbook -i hosts/verb_hosts ardana-deploy.yml \
    -e rebuild=True --limit deployer_node_name,localhost
    ardana > ansible-playbook -i hosts/verb_hosts tempest-deploy.yml
    Important
    Important

    If you receive the message stderr: Error: mnesia_not_running when running the ardana-deploy.yml playbook, it is likely due to one of the following conditions:

    • RabbitMQ was not running on the clustered node

    • The old node was not removed from the cluster

    Correct this problem with the following steps:

    1. Of the remaining clustered nodes (M2 and M3), M2 is the new master. Make sure the application has started and M1 is no longer a member. On the M2 node, run:

      ardana > sudo rabbitmqctl start_app; \
      ardana > sudo rabbitmqctl forget_cluster_node rabbit@M1

      Check that M1 is no longer a member of the cluster.

      ardana > sudo rabbitmqctl cluster_status
    2. On the newly installed node, M1, make sure RabbitMQ has stopped. On M1, run:

      ardana > sudo rabbitmqctl stop_app
    3. Re-run the ardana-deploy.yml playbook as before.

  11. During the replacement of the node, alarms will show up during the process. If those do not clear after the node is back up and healthy, restart the threshold engine by running the following playbooks:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-stop.yml --tags thresh
    ardana > ansible-playbook -i hosts/verb_hosts monasca-start.yml --tags thresh
15.1.2.1.2 Replacing a Standalone Controller Node

If the controller node you need to replace is not also being used as the Cloud Lifecycle Manager, follow the steps below.

  1. Log in to the Cloud Lifecycle Manager.

  2. Update your cloud model, specifically the servers.yml file, with the new mac-addr, ilo-ip, ilo-password, and ilo-user fields where these have changed. Do not change the id, ip-addr, role, or server-group settings.

  3. Commit your configuration to the Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”, as follows:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  4. Run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  5. Update your deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  6. Remove the old controller node(s) from Cobbler. You can list out the systems in Cobbler currently with this command:

    ardana > sudo cobbler system list

    and then remove the old controller nodes with this command:

    ardana > sudo cobbler system remove --name <node>
  7. Remove the SSH key of the old controller node from the known hosts file. You will specify the ip-addr value:

    ardana > ssh-keygen -f "~/.ssh/known_hosts" -R <ip_addr>

    You should see a response similar to this one:

    ardana@ardana-cp1-c1-m1-mgmt:~/openstack/ardana/ansible$ ssh-keygen -f "~/.ssh/known_hosts" -R 10.13.111.135
    # Host 10.13.111.135 found: line 6 type ECDSA
    ~/.ssh/known_hosts updated.
    Original contents retained as ~/.ssh/known_hosts.old
  8. Run the cobbler-deploy playbook to add the new controller node:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost cobbler-deploy.yml
  9. Image the new node(s) by using the bm-reimage playbook. You will specify the name for the node in Cobbler in the command:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<node-name>
    Important
    Important

    You must ensure that the old controller node is powered off before completing this step. This is because the new controller node will re-use the original IP address.

  10. Run the wipe_disks.yml playbook to ensure all non-OS partitions on the new node are completely wiped prior to continuing with the installation. (The value to be used for hostname is the host's identifier from ~/scratch/ansible/next/ardana/ansible/hosts/verb_hosts.)

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts wipe_disks.yml --limit <hostname>
  11. Run osconfig on the replacement controller node. For example:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts osconfig-run.yml -e rebuild=True --limit=<controller-hostname>
  12. If the controller being replaced is the swift ring builder (see Section 18.6.2.4, “Identifying the Swift Ring Building Server”) you need to restore the swift ring builder files to the /etc/swiftlm/CLOUD-NAME/CONTROL-PLANE-NAME/builder_dir directory. See Section 18.6.2.7, “Recovering swift Builder Files” for details.

  13. Run the ardana-deploy playbook on the replacement controller.

    If the node being replaced is the swift ring builder server then you only need to use the --limit switch for that node, otherwise you need to specify the hostname of your swift ringer builder server and the hostname of the node being replaced.

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-deploy.yml -e rebuild=True
    --limit=<controller-hostname>,<swift-ring-builder-hostname>
    Important
    Important

    If you receive a keystone failure when running this playbook, it is likely due to Fernet keys being out of sync. This problem can be corrected by running the keystone-reconfigure.yml playbook to re-sync the Fernet keys.

    In this situation, do not use the --limit option when running keystone-reconfigure.yml. In order to re-sync Fernet keys, all the controller nodes must be in the play.

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts keystone-reconfigure.yml
    Important
    Important

    If you receive a RabbitMQ failure when running this playbook, review Section 18.2.1, “Understanding and Recovering RabbitMQ after Failure” for how to resolve the issue and then re-run the ardana-deploy playbook.

  14. During the replacement of the node there will be alarms that show up during the process. If those do not clear after the node is back up and healthy, restart the threshold engine by running the following playbooks:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-stop.yml --tags thresh
    ardana > ansible-playbook -i hosts/verb_hosts monasca-start.yml --tags thresh

15.1.3 Planned Compute Maintenance

Planned maintenance tasks for compute nodes.

15.1.3.1 Planned Maintenance for a Compute Node

If one or more of your compute nodes needs hardware maintenance and you can schedule a planned maintenance then this procedure should be followed.

15.1.3.1.1 Performing planned maintenance on a compute node

If you have planned maintenance to perform on a compute node, you have to take it offline, repair it, and restart it. To do so, follow these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Source the administrator credentials:

    source ~/service.osrc
  3. Obtain the hostname for your compute node, which you will use in subsequent commands when <hostname> is requested:

    openstack host list | grep compute

    The following example shows two compute nodes:

    $ openstack host list | grep compute
    | ardana-cp1-comp0001-mgmt | compute     | AZ1      |
    | ardana-cp1-comp0002-mgmt | compute     | AZ2      |
  4. Disable provisioning on the compute node, which will prevent additional instances from being spawned on it:

    openstack compute service set –disable --reason "Maintenance mode" <hostname>
    Note
    Note

    Make sure you re-enable provisioning after the maintenance is complete if you want to continue to be able to spawn instances on the node. You can do this with the command:

    openstack compute service set –enable <hostname>
  5. At this point you have two choices:

    1. Live migration: This option enables you to migrate the instances off the compute node with minimal downtime so you can perform the maintenance without risk of losing data.

    2. Stop/start the instances: Issuing openstack server stop commands to each of the instances will halt them. This option lets you do maintenance and then start the instances back up, as long as no disk failures occur on the compute node data disks. This method involves downtime for the length of the maintenance.

    If you choose the live migration route, See Section 15.1.3.3, “Live Migration of Instances” for more details. Skip to step #6 after you finish live migration.

    If you choose the stop start method, continue on.

    1. List all of the instances on the node so you can issue stop commands to them:

      openstack server list --host <hostname> --all-tenants
    2. Issue the openstack server stop command against each of the instances:

      openstack server stop <instance uuid>
    3. Confirm that the instances are stopped. If stoppage was successful you should see the instances in a SHUTOFF state, as shown here:

      $ openstack server list --host ardana-cp1-comp0002-mgmt --all-tenants
      +--------------------------------------+-----------+----------------------------------+---------+------------+-------------+-----------------------+
      | ID                                   | Name      | Tenant ID                        | Status  | Task State | Power State | Networks              |
      +--------------------------------------+-----------+----------------------------------+---------+------------+-------------+-----------------------+
      | ef31c453-f046-4355-9bd3-11e774b1772f | instance1 | 4365472e025c407c8d751fc578b7e368 | SHUTOFF | -          | Shutdown    | demo_network=10.0.0.5 |
      +--------------------------------------+-----------+----------------------------------+---------+------------+-------------+-----------------------+
    4. Do your required maintenance. If this maintenance does not take down the disks completely then you should be able to list the instances again after the repair and confirm that they are still in their SHUTOFF state:

      openstack server list --host <hostname> --all-tenants
    5. Start the instances back up using this command:

      openstack server start <instance uuid>

      Example:

      $ openstack server start ef31c453-f046-4355-9bd3-11e774b1772f
      Request to start server ef31c453-f046-4355-9bd3-11e774b1772f has been accepted.
    6. Confirm that the instances started back up. If restarting is successful you should see the instances in an ACTIVE state, as shown here:

      $ openstack server list --host ardana-cp1-comp0002-mgmt --all-tenants
      +--------------------------------------+-----------+----------------------------------+--------+------------+-------------+-----------------------+
      | ID                                   | Name      | Tenant ID                        | Status | Task State | Power State | Networks              |
      +--------------------------------------+-----------+----------------------------------+--------+------------+-------------+-----------------------+
      | ef31c453-f046-4355-9bd3-11e774b1772f | instance1 | 4365472e025c407c8d751fc578b7e368 | ACTIVE | -          | Running     | demo_network=10.0.0.5 |
      +--------------------------------------+-----------+----------------------------------+--------+------------+-------------+-----------------------+
    7. If the openstack server start fails, you can try doing a hard reboot:

      openstack server reboot --hard <instance uuid>

      If this does not resolve the issue you may want to contact support.

  6. Re-enable provisioning when the node is fixed:

    openstack compute service set –enable <hostname>

15.1.3.2 Rebooting a Compute Node

If all you need to do is reboot a Compute node, the following steps can be used.

You can choose to live migrate all Compute instances off the node prior to the reboot. Any instances that remain will be restarted when the node is rebooted. This playbook will ensure that all services on the Compute node are restarted properly.

  1. Log in to the Cloud Lifecycle Manager.

  2. Reboot the Compute node(s) with the following playbook.

    You can specify either single or multiple Compute nodes using the --limit switch.

    An optional reboot wait time can also be specified. If no reboot wait time is specified it will default to 300 seconds.

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts nova-compute-reboot.yml --limit [compute_node_or_list] [-e nova_reboot_wait_timeout=(seconds)]
    Note
    Note

    If the Compute node fails to reboot, you should troubleshoot this issue separately as this playbook will not attempt to recover after a failed reboot.

15.1.3.3 Live Migration of Instances

Live migration allows you to move active compute instances between compute nodes, allowing for less downtime during maintenance.

SUSE OpenStack Cloud nova offers a set of commands that allow you to move compute instances between compute hosts. Which command you use will depend on the state of the host, what operating system is on the host, what type of storage the instances are using, and whether you want to migrate a single instance or all of the instances off of the host. We will describe these options on this page as well as give you step-by-step instructions for performing them.

15.1.3.3.1 Migration Options

If your compute node has failed

A compute host failure could be caused by hardware failure, such as the data disk needing to be replaced, power has been lost, or any other type of failure which requires that you replace the baremetal host. In this scenario, the instances on the compute node are unrecoverable and any data on the local ephemeral storage is lost. If you are utilizing block storage volumes, either as a boot device or as additional storage, they should be unaffected.

In these cases you will want to use one of the nova evacuate commands, which will cause nova to rebuild the instances on other hosts.

This table describes each of the evacuate options for failed compute nodes:

CommandDescription

nova evacuate <instance> <hostname>

This command is used to evacuate a single instance from a failed host. You specify the compute instance UUID and the target host you want to evacuate it to. If no host is specified then the nova scheduler will choose one for you.

See nova help evacuate for more information and syntax. Further details can also be seen in the OpenStack documentation at http://docs.openstack.org/admin-guide/cli_nova_evacuate.html.

nova host-evacuate <hostname> --target_host <target_hostname>

This command is used to evacuate all instances from a failed host. You specify the hostname of the compute host you want to evacuate. Optionally you can specify a target host. If no target host is specified then the nova scheduler will choose a target host for each instance.

See nova help host-evacuate for more information and syntax.

If your compute host is active, powered on and the data disks are in working order you can utilize the migration commands to migrate your compute instances. There are two migration features, "cold" migration (also referred to simply as "migration") and live migration. Migration and live migration are two different functions.

Cold migration is used to copy an instances data in a SHUTOFF status from one compute host to another. It does this using passwordless SSH access which has security concerns associated with it. For this reason, the openstack server migrate function has been disabled by default but you have the ability to enable this feature if you would like. Details on how to do this can be found in Section 6.4, “Enabling the Nova Resize and Migrate Features”.

Live migration can be performed on instances in either an ACTIVE or PAUSED state and uses the QEMU hypervisor to manage the copy of the running processes and associated resources to the destination compute host using the hypervisors own protocol and thus is a more secure method and allows for less downtime. There may be a short network outage, usually a few milliseconds but could be up to a few seconds if your compute instances are busy, during a live migration. Also there may be some performance degredation during the process.

The compute host must remain powered on during the migration process.

Both the cold migration and live migration options will honor nova group policies, which includes affinity settings. There is a limitation to keep in mind if you use group policies and that is discussed in the Section 15.1.3.3, “Live Migration of Instances” section.

This table describes each of the migration options for active compute nodes:

CommandDescriptionSLES

openstack server migrate <instance_uuid>

Used to cold migrate a single instance from a compute host. The nova-scheduler will choose the new host.

This command will work against instances in an ACTIVE or SHUTOFF state. The instances, if active, will be shutdown and restarted. Instances in a PAUSED state cannot be cold migrated.

See the difference between cold migration and live migration at the start of this section.

 

nova host-servers-migrate <hostname>

Used to cold migrate all instances off a specified host to other available hosts, chosen by the nova-scheduler.

This command will work against instances in an ACTIVE or SHUTOFF state. The instances, if active, will be shutdown and restarted. Instances in a PAUSED state cannot be cold migrated.

See the difference between cold migration and live migration at the start of this section.

 

nova live-migration <instance_uuid> [<target host>]

Used to migrate a single instance between two compute hosts. You can optionally specify a target host or you can allow the nova scheduler to choose a host for you. If you choose to specify a target host, ensure that the target host has enough resources to host the instance prior to live migration.

Works for instances that are booted from a block storage volume or that have any number of block storage volumes attached.

This command works against instances in ACTIVE or PAUSED states only.

X

nova live-migration --block-migrate <instance_uuid> [<target host>]

Used to migrate a single instance between two compute hosts. You can optionally specify a target host or you can allow the nova scheduler to choose a host for you. If you choose to specify a target host, ensure that the target host has enough resources to host the instance prior to live migration.

Works for instances that are booted from local (ephemeral) disk(s) only or if your instance has a mix of ephemeral disk(s) and block storage volume(s) but are not booted from a block storage volume.

This command works against instances in ACTIVE or PAUSED states only.

X

nova host-evacuate-live <hostname> [--target-host <target_hostname>]

Used to live migrate all instances off of a compute host. You can optionally specify a target host or you can allow the nova scheduler to choose a host for you. If you choose to specify a target host, ensure that the target host has enough resources to host the instance prior to live migration.

Works for instances that are booted from a block storage volume or that have any number of block storage volumes attached.

This command works against instances in ACTIVE or PAUSED states only.

X

nova host-evacuate-live --block-migrate <hostname> [--target-host <target_hostname>]

Used to live migrate all instances off of a compute host. You can optionally specify a target host or you can allow the nova scheduler to choose a host for you. If you choose to specify a target host, ensure that the target host has enough resources to host the instance prior to live migration.

Works for instances that are booted from local (ephemeral) disk(s) only or if your instance has a mix of ephemeral disk(s) and block storage volume(s) but are not booted from a block storage volume.

This command works against instances in ACTIVE or PAUSED states only.

X
15.1.3.3.2 Limitations of these Features

There are limitations that may impact your use of this feature:

  • To use live migration, your compute instances must be in either an ACTIVE or PAUSED state on the compute host. If you have instances in a SHUTOFF state then cold migration should be used.

  • Instances in a Paused state cannot be live migrated using the horizon dashboard. You will need to utilize the python-novaclient CLI to perform these.

  • Both cold migration and live migration honor an instance's group policies. If you are utilizing an affinity policy and are migrating multiple instances you may run into an error stating no hosts are available to migrate to. To work around this issue you should specify a target host when migrating these instances, which will bypass the nova-scheduler. You should ensure that the target host you choose has the resources available to host the instances.

  • The nova host-evacuate-live command will produce an error if you have a compute host that has a mix of instances that use local ephemeral storage and instances that are booted from a block storage volume or have any number of block storage volumes attached. If you have a mix of these instance types, you may need to run the command twice, utilizing the --block-migrate option. This is described in further detail in Section 15.1.3.3, “Live Migration of Instances”.

  • Instances on KVM hosts can only be live migrated to other KVM hosts.

  • The migration options described in this document are not available on ESX compute hosts.

  • Ensure that you read and take into account any other limitations that exist in the release notes. See the release notes for more details.

15.1.3.3.3 Performing a Live Migration

Cloud administrators can perform a migration on an instance using either the horizon dashboard, API, or CLI. Instances in a Paused state cannot be live migrated using the horizon GUI. You will need to utilize the CLI to perform these.

We have documented different scenarios:

15.1.3.3.4 Migrating instances off of a failed compute host
  1. Log in to the Cloud Lifecycle Manager.

  2. If the compute node is not already powered off, do so with this playbook:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost bm-power-down.yml -e nodelist=<node_name>
    Note
    Note

    The value for <node_name> will be the name that Cobbler has when you run sudo cobbler system list from the Cloud Lifecycle Manager.

  3. Source the admin credentials necessary to run administrative commands against the nova API:

    source ~/service.osrc
  4. Force the nova-compute service to go down on the compute node:

    openstack compute service set --down HOSTNAME nova-compute
    Note
    Note

    The value for HOSTNAME can be obtained by using openstack host list from the Cloud Lifecycle Manager.

  5. Evacuate the instances off of the failed compute node. This will cause the nova-scheduler to rebuild the instances on other valid hosts. Any local ephemeral data on the instances is lost.

    For single instances on a failed host:

    nova evacuate <instance_uuid> <target_hostname>

    For all instances on a failed host:

    nova host-evacuate <hostname> [--target_host <target_hostname>]
  6. When you have repaired the failed node and start it back up again, when the nova-compute process starts again, it will clean up the evacuated instances.

15.1.3.3.5 Migrating instances off of an active compute host

Migrating instances using the horizon dashboard

The horizon dashboard offers a GUI method for performing live migrations. Instances in a Paused state will not provide you the live migration option in horizon so you will need to use the CLI instructions in the next section to perform these.

  1. Log into the horizon dashboard with admin credentials.

  2. Navigate to the menu Admin › Compute › Instances.

  3. Next to the instance you want to migrate, select the drop down menu and choose the Live Migrate Instance option.

  4. In the Live Migrate wizard you will see the compute host the instance currently resides on and then a drop down menu that allows you to choose the compute host you want to migrate the instance to. Select a destination host from that menu. You also have two checkboxes for additional options, which are described below:

    Disk Over Commit - If this is not checked then the value will be False. If you check this box then it will allow you to override the check that occurs to ensure the destination host has the available disk space to host the instance.

    Block Migration - If this is not checked then the value will be False. If you check this box then it will migrate the local disks by using block migration. Use this option if you are only using ephemeral storage on your instances. If you are using block storage for your instance then ensure this box is not checked.

  5. To begin the live migration, click Submit.

Migrating instances using the python-novaclient CLI

To perform migrations from the command-line, use the python-novaclient. The Cloud Lifecycle Manager node in your cloud environment should have the python-novaclient already installed. If you will be accessing your environment through a different method, ensure that the python-novaclient is installed. You can do so using Python's pip package manager.

To run the commands in the steps below, you need administrator credentials. From the Cloud Lifecycle Manager, you can source the service.osrc file which is provided that has the necessary credentials:

source ~/service.osrc

Here are the steps to perform:

  1. Log in to the Cloud Lifecycle Manager.

  2. Identify the instances on the compute node you wish to migrate:

    openstack server list --all-tenants --host <hostname>

    Example showing a host with a single compute instance on it:

    ardana >  openstack server list --host ardana-cp1-comp0001-mgmt --all-tenants
    +--------------------------------------+------+----------------------------------+--------+------------+-------------+-----------------------+
    | ID                                   | Name | Tenant ID                        | Status | Task State | Power State | Networks              |
    +--------------------------------------+------+----------------------------------+--------+------------+-------------+-----------------------+
    | 553ba508-2d75-4513-b69a-f6a2a08d04e3 | test | 193548a949c146dfa1f051088e141f0b | ACTIVE | -          | Running     | adminnetwork=10.0.0.5 |
    +--------------------------------------+------+----------------------------------+--------+------------+-------------+-----------------------+
  3. When using live migration you can either specify a target host that the instance will be migrated to or you can omit the target to allow the nova-scheduler to choose a node for you. If you want to get a list of available hosts you can use this command:

    openstack host list
  4. Migrate the instance(s) on the compute node using the notes below.

    If your instance is booted from a block storage volume or has any number of block storage volumes attached, use the nova live-migration command with this syntax:

    nova live-migration <instance uuid> [<target compute host>]

    If your instance has local (ephemeral) disk(s) only or if your instance has a mix of ephemeral disk(s) and block storage volume(s), you should use the --block-migrate option:

    nova live-migration --block-migrate <instance uuid> [<target compute host>]
    Note
    Note

    The [<target compute host>] option is optional. If you do not specify a target host then the nova scheduler will choose a node for you.

    Multiple instances

    If you want to live migrate all of the instances off a single compute host you can utilize the nova host-evacuate-live command.

    Issue the host-evacuate-live command, which will begin the live migration process.

    If all of the instances on the host are using at least one local (ephemeral) disk, you should use this syntax:

    nova host-evacuate-live --block-migrate <hostname>

    Alternatively, if all of the instances are only using block storage volumes then omit the --block-migrate option:

    nova host-evacuate-live <hostname>
    Note
    Note

    You can either let the nova-scheduler choose a suitable target host or you can specify one using the --target-host <hostname> switch. See nova help host-evacuate-live for details.

15.1.3.3.6 Troubleshooting migration or host evacuate issues

Issue: When attempting to use nova host-evacuate-live against a node, you receive the error below:

$ nova host-evacuate-live ardana-cp1-comp0001-mgmt --target-host ardana-cp1-comp0003-mgmt
+--------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Server UUID                          | Live Migration Accepted | Error Message                                                                                                                                                                                                                                                                        |
+--------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 95a7ded8-ebfc-4848-9090-2df378c88a4c | False                   | Error while live migrating instance: ardana-cp1-comp0001-mgmt is not on shared storage: Live migration can not be used without shared storage except a booted from volume VM which does not have a local disk. (HTTP 400) (Request-ID: req-9fd79670-a780-40ed-a515-c14e28e0a0a7)     |
| 13ab4ef7-0623-4d00-bb5a-5bb2f1214be4 | False                   | Error while live migrating instance: ardana-cp1-comp0001-mgmt is not on shared storage: Live migration cannot be used without shared storage except a booted from volume VM which does not have a local disk. (HTTP 400) (Request-ID: req-26834267-c3ec-4f8b-83cc-5193d6a394d6)     |
+--------------------------------------+-------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Fix: This occurs when you are attempting to live evacuate a host that contains instances booted from local storage and you are not specifying --block-migrate in your command. Re-attempt the live evacuation with this syntax:

nova host-evacuate-live --block-migrate <hostname> [--target-host <target_hostname>]

Issue: When attempting to use nova host-evacuate-live against a node, you receive the error below:

$ nova host-evacuate-live --block-migrate ardana-cp1-comp0001-mgmt --target-host ardana-cp1-comp0003-mgmt
+--------------------------------------+-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Server UUID                          | Live Migration Accepted | Error Message                                                                                                                                                                                                     |
+--------------------------------------+-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| e9874122-c5dc-406f-9039-217d9258c020 | False                   | Error while live migrating instance: ardana-cp1-comp0001-mgmt is not on local storage: Block migration can not be used with shared storage. (HTTP 400) (Request-ID: req-60b1196e-84a0-4b71-9e49-96d6f1358e1a)     |
| 84a02b42-9527-47ac-bed9-8fde1f98e3fe | False                   | Error while live migrating instance: ardana-cp1-comp0001-mgmt is not on local storage: Block migration can not be used with shared storage. (HTTP 400) (Request-ID: req-0cdf1198-5dbd-40f4-9e0c-e94aa1065112)     |
+--------------------------------------+-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Fix: This occurs when you are attempting to live evacuate a host that contains instances booted from a block storage volume and you are specifying --block-migrate in your command. Re-attempt the live evacuation with this syntax:

nova host-evacuate-live <hostname> [--target-host <target_hostname>]

Issue: When attempting to use nova live-migration against an instance, you receive the error below:

$ nova live-migration 2a13ffe6-e269-4d75-8e46-624fec7a5da0 ardana-cp1-comp0002-mgmt
ERROR (BadRequest): ardana-cp1-comp0001-mgmt is not on shared storage: Live migration can not be used without shared storage except a booted from volume VM which does not have a local disk. (HTTP 400) (Request-ID: req-158dd415-0bb7-4613-8529-6689265387e7)

Fix: This occurs when you are attempting to live migrate an instance that was booted from local storage and you are not specifying --block-migrate in your command. Re-attempt the live migration with this syntax:

nova live-migration --block-migrate <instance_uuid> <target_hostname>

Issue: When attempting to use nova live-migration against an instance, you receive the error below:

$ nova live-migration --block-migrate 84a02b42-9527-47ac-bed9-8fde1f98e3fe ardana-cp1-comp0001-mgmt
ERROR (BadRequest): ardana-cp1-comp0002-mgmt is not on local storage: Block migration can not be used with shared storage. (HTTP 400) (Request-ID: req-51fee8d6-6561-4afc-b0c9-7afa7dc43a5b)

Fix: This occurs when you are attempting to live migrate an instance that was booted from a block storage volume and you are specifying --block-migrate in your command. Re-attempt the live migration with this syntax:

nova live-migration <instance_uuid> <target_hostname>

15.1.3.4 Adding Compute Node

Adding a Compute Node allows you to add capacity.

15.1.3.4.1 Adding a SLES Compute Node

Adding a SLES compute node allows you to add additional capacity for more virtual machines.

You may have a need to add additional SLES compute hosts for more virtual machine capacity or another purpose and these steps will help you achieve this.

There are two methods you can use to add SLES compute hosts to your environment:

  1. Adding SLES pre-installed compute hosts. This method does not require the SLES ISO be on the Cloud Lifecycle Manager to complete.

  2. Using the provided Ansible playbooks and Cobbler, SLES will be installed on your new compute hosts. This method requires that you provided a SUSE Linux Enterprise Server 12 SP4 ISO during the initial installation of your cloud, following the instructions at Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 31 “Installing SLES Compute”, Section 31.1 “SLES Compute Node Installation Overview”.

    If you want to use the provided Ansible playbooks and Cobbler to setup and configure your SLES hosts and you did not have the SUSE Linux Enterprise Server 12 SP4 ISO on your Cloud Lifecycle Manager during your initial installation then ensure you look at the note at the top of that section before proceeding.

15.1.3.4.1.1 Prerequisites

You need to ensure your input model files are properly setup for SLES compute host clusters. This must be done during the installation process of your cloud and is discussed further at Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 31 “Installing SLES Compute”, Section 31.3 “Using the Cloud Lifecycle Manager to Deploy SLES Compute Nodes” and Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 10 “Modifying Example Configurations for Compute Nodes”, Section 10.1 “SLES Compute Nodes”.

15.1.3.4.1.2 Adding a SLES compute node

Adding pre-installed SLES compute hosts

This method requires that you have SUSE Linux Enterprise Server 12 SP4 pre-installed on the baremetal host prior to beginning these steps.

  1. Ensure you have SUSE Linux Enterprise Server 12 SP4 pre-installed on your baremetal host.

  2. Log in to the Cloud Lifecycle Manager.

  3. Edit your ~/openstack/my_cloud/definition/data/servers.yml file to include the details about your new compute host(s).

    For example, if you already had a cluster of three SLES compute hosts using the SLES-COMPUTE-ROLE role and needed to add a fourth one you would add your details to the bottom of the file in the format. Note that we left out the IPMI details because they will not be needed since you pre-installed the SLES OS on your host(s).

    - id: compute4
      ip-addr: 192.168.102.70
      role: SLES-COMPUTE-ROLE
      server-group: RACK1

    You can find detailed descriptions of these fields in Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”, Section 6.5 “Servers”. Ensure that you use the same role for any new SLES hosts you are adding as you specified on your existing SLES hosts.

    Important
    Important

    You will need to verify that the ip-addr value you choose for this host does not conflict with any other IP address in your cloud environment. You can confirm this by checking the ~/openstack/my_cloud/info/address_info.yml file on your Cloud Lifecycle Manager.

  4. In your ~/openstack/my_cloud/definition/data/control_plane.yml file you will need to check the values for member-count, min-count, and max-count. If you specified them, ensure that they match up with your new total node count. For example, if you had previously specified member-count: 3 and are adding a fourth compute node, you will need to change that value to member-count: 4.

    See for Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”, Section 6.2 “Control Plane” more details.

  5. Commit the changes to git:

    ardana > git add -A
    ardana > git commit -a -m "Add node <name>"
  6. Run the configuration processor and resolve any errors that are indicated:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  7. Update your deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml

    Before proceeding, you may want to take a look at info/server_info.yml to see if the assignment of the node you have added is what you expect. It may not be, as nodes will not be numbered consecutively if any have previously been removed. This is to prevent loss of data; the config processor retains data about removed nodes and keeps their ID numbers from being reallocated. See Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 7 “Other Topics”, Section 7.3 “Persisted Data”, Section 7.3.1 “Persisted Server Allocations” for information on how this works.

  8. [OPTIONAL] - Run the wipe_disks.yml playbook to ensure all of your non-OS partitions on your nodes are completely wiped prior to continuing with the installation.

    Note
    Note

    The wipe_disks.yml playbook is only meant to be run on systems immediately after running bm-reimage.yml. If used for any other case, it may not wipe all of the expected partitions.

    The value to be used for hostname is host's identifier from ~/scratch/ansible/next/ardana/ansible/hosts/verb_hosts.

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts wipe_disks.yml --limit <hostname>
  9. Complete the compute host deployment with this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts site.yml --tag "generate_hosts_file"
    ardana > ansible-playbook -i hosts/verb_hosts site.yml --limit <hostname>

Adding SLES compute hosts with Ansible playbooks and Cobbler

These steps will show you how to add the new SLES compute host to your servers.yml file and then run the playbooks that update your cloud configuration. You will run these playbooks from the lifecycle manager.

If you did not have the SUSE Linux Enterprise Server 12 SP4 ISO available on your Cloud Lifecycle Manager during your initial installation, it must be installed before proceeding further. Instructions can be found in Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 31 “Installing SLES Compute”.

When you are prepared to continue, use these steps:

  1. Log in to your Cloud Lifecycle Manager.

  2. Checkout the site branch of your local git so you can begin to make the necessary edits:

    ardana > cd ~/openstack/my_cloud/definition/data
    ardana > git checkout site
  3. Edit your ~/openstack/my_cloud/definition/data/servers.yml file to include the details about your new compute host(s).

    For example, if you already had a cluster of three SLES compute hosts using the SLES-COMPUTE-ROLE role and needed to add a fourth one you would add your details to the bottom of the file in this format:

    - id: compute4
      ip-addr: 192.168.102.70
      role: SLES-COMPUTE-ROLE
      server-group: RACK1
      mac-addr: e8:39:35:21:32:4e
      ilo-ip: 10.1.192.36
      ilo-password: password
      ilo-user: admin
      distro-id: sles12sp4-x86_64

    You can find detailed descriptions of these fields in Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”, Section 6.5 “Servers”. Ensure that you use the same role for any new SLES hosts you are adding as you specified on your existing SLES hosts.

    Important
    Important

    You will need to verify that the ip-addr value you choose for this host does not conflict with any other IP address in your cloud environment. You can confirm this by checking the ~/openstack/my_cloud/info/address_info.yml file on your Cloud Lifecycle Manager.

  4. In your ~/openstack/my_cloud/definition/data/control_plane.yml file you will need to check the values for member-count, min-count, and max-count. If you specified them, ensure that they match up with your new total node count. For example, if you had previously specified member-count: 3 and are adding a fourth compute node, you will need to change that value to member-count: 4.

    See Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”, Section 6.2 “Control Plane” for more details.

  5. Commit the changes to git:

    ardana > git add -A
    ardana > git commit -a -m "Add node <name>"
  6. Run the configuration processor and resolve any errors that are indicated:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  7. The following playbook confirms that your servers are accessible over their IPMI ports.

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost bm-power-status.yml -e nodelist=compute4
  8. Add the new node into Cobbler:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost cobbler-deploy.yml
  9. Run the following playbook, ensuring that you specify only your UEFI SLES nodes using the nodelist. This playbook will reconfigure Cobbler for the nodes listed.

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook prepare-sles-grub2.yml -e nodelist=node1[,node2,node3]
  10. Then you can image the node:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<node name>
    Note
    Note

    If you do not know the <node name>, you can get it by using sudo cobbler system list.

    Before proceeding, you may want to take a look at info/server_info.yml to see if the assignment of the node you have added is what you expect. It may not be, as nodes will not be numbered consecutively if any have previously been removed. This is to prevent loss of data; the config processor retains data about removed nodes and keeps their ID numbers from being reallocated. See Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 7 “Other Topics”, Section 7.3 “Persisted Data”, Section 7.3.1 “Persisted Server Allocations” for information on how this works.

  11. Update your deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  12. [OPTIONAL] - Run the wipe_disks.yml playbook to ensure all of your non-OS partitions on your hosts are completely wiped prior to continuing with the installation. The wipe_disks.yml playbook is only meant to be run on systems immediately after running bm-reimage.yml. If used for any other case, it may not wipe all of the expected partitions.

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts wipe_disks.yml --limit <hostname>
    Note
    Note

    You can obtain the <hostname> from the file ~/scratch/ansible/next/ardana/ansible/hosts/verb_hosts.

  13. You should verify that the netmask, bootproto, and other necessary settings are correct and if they are not then re-do them. See Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 31 “Installing SLES Compute” for details.

  14. Complete the compute host deployment with these playbooks. For the last one, ensure you specify the compute hosts you are added with the --limit switch:

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts site.yml --tag "generate_hosts_file"
    ardana > ansible-playbook -i hosts/verb_hosts site.yml --limit <hostname>
15.1.3.4.1.3 Adding a new SLES compute node to monitoring

If you want to add a new Compute node to the monitoring service checks, there is an additional playbook that must be run to ensure this happens:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts monasca-deploy.yml --tags "active_ping_checks"

15.1.3.5 Removing a Compute Node

Removing a Compute node allows you to remove capacity.

You may have a need to remove a Compute node and these steps will help you achieve this.

15.1.3.5.1 Disable Provisioning on the Compute Host
  1. Get a list of the nova services running which will provide us with the details we need to disable the provisioning on the Compute host you are wanting to remove:

    ardana > openstack compute service list

    Here is an example below. I've highlighted the Compute node we are going to remove in the examples:

    ardana > openstack compute service list
    +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+
    | Id | Binary           | Host                     | Zone     | Status  | State | Updated_at                 | Disabled Reason |
    +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+
    | 1  | nova-conductor   | ardana-cp1-c1-m1-mgmt    | internal | enabled | up    | 2015-11-22T22:50:43.000000 | -               |
    | 10 | nova-scheduler   | ardana-cp1-c1-m1-mgmt    | internal | enabled | up    | 2015-11-22T22:50:34.000000 | -               |
    | 13 | nova-conductor   | ardana-cp1-c1-m3-mgmt    | internal | enabled | up    | 2015-11-22T22:50:43.000000 | -               |
    | 16 | nova-conductor   | ardana-cp1-c1-m2-mgmt    | internal | enabled | up    | 2015-11-22T22:50:43.000000 | -               |
    | 28 | nova-scheduler   | ardana-cp1-c1-m2-mgmt    | internal | enabled | up    | 2015-11-22T22:50:38.000000 | -               |
    | 31 | nova-scheduler   | ardana-cp1-c1-m3-mgmt    | internal | enabled | up    | 2015-11-22T22:50:42.000000 | -               |
    | 34 | nova-compute     | ardana-cp1-comp0001-mgmt | AZ1      | enabled | up    | 2015-11-22T22:50:35.000000 | -               |
    | 37 | nova-compute     | ardana-cp1-comp0002-mgmt | AZ2      | enabled | up    | 2015-11-22T22:50:44.000000 | -               |
    +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+
  2. Disable the nova service on the Compute node you are wanting to remove which will ensure it is taken out of the scheduling rotation:

    ardana > compute service set --disable --reason "enter reason here" node hostname

    Here is an example if I wanted to remove the ardana-cp1-comp0002-mgmt in the output above:

    ardana > compute service set –disable --reason "hardware reallocation" ardana-cp1-comp0002-mgmt
    +--------------------------+--------------+----------+-----------------------+
    | Host                     | Binary       | Status   | Disabled Reason       |
    +--------------------------+--------------+----------+-----------------------+
    | ardana-cp1-comp0002-mgmt | nova-compute | disabled | hardware reallocation |
    +--------------------------+--------------+----------+-----------------------+
15.1.3.5.2 Remove the Compute Host from its Availability Zone

If you configured the Compute host to be part of an availability zone, these steps will show you how to remove it.

  1. Get a list of the nova services running which will provide us with the details we need to remove a Compute node:

    ardana > openstack compute service list

    Here is an example below. I've highlighted the Compute node we are going to remove in the examples:

    ardana > openstack compute service list
    +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------------+
    | Id | Binary           | Host                     | Zone     | Status  | State | Updated_at                 | Disabled Reason       |
    +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------------+
    | 1  | nova-conductor   | ardana-cp1-c1-m1-mgmt    | internal | enabled | up    | 2015-11-22T22:50:43.000000 | -                     |
    | 10 | nova-scheduler   | ardana-cp1-c1-m1-mgmt    | internal | enabled | up    | 2015-11-22T22:50:34.000000 | -                     |
    | 13 | nova-conductor   | ardana-cp1-c1-m3-mgmt    | internal | enabled | up    | 2015-11-22T22:50:43.000000 | -                     |
    | 16 | nova-conductor   | ardana-cp1-c1-m2-mgmt    | internal | enabled | up    | 2015-11-22T22:50:43.000000 | -                     |
    | 28 | nova-scheduler   | ardana-cp1-c1-m2-mgmt    | internal | enabled | up    | 2015-11-22T22:50:38.000000 | -                     |
    | 31 | nova-scheduler   | ardana-cp1-c1-m3-mgmt    | internal | enabled | up    | 2015-11-22T22:50:42.000000 | -                     |
    | 34 | nova-compute     | ardana-cp1-comp0001-mgmt | AZ1      | enabled | up    | 2015-11-22T22:50:35.000000 | -                     |
    | 37 | nova-compute     | ardana-cp1-comp0002-mgmt | AZ2      | enabled | up    | 2015-11-22T22:50:44.000000 | hardware reallocation |
    +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------------+
  2. If the Zone reported for this host is simply "nova", then it is not a member of a particular availability zone, and this step will not be necessary. Otherwise, you must remove the Compute host from its availability zone:

    ardana > openstack aggregate remove host availability zone nova hostname

    So for the same example in the previous step, the ardana-cp1-comp0002-mgmt host was in the AZ2 availability zone so you would use this command to remove it:

    ardana > openstack aggregate remove host AZ2 ardana-cp1-comp0002-mgmt
    Host ardana-cp1-comp0002-mgmt has been successfully removed from aggregate 4
    +----+------+-------------------+-------+-------------------------+
    | Id | Name | Availability Zone | Hosts | Metadata                |
    +----+------+-------------------+-------+-------------------------+
    | 4  | AZ2  | AZ2               |       | 'availability_zone=AZ2' |
    +----+------+-------------------+-------+-------------------------+
  3. You can confirm the last two steps completed successfully by running another openstack compute service list.

    Here is an example which confirms that the node has been disabled and that it has been removed from the availability zone. I have highlighted these:

    ardana > openstack compute service list
    +----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+
    | Id | Binary           | Host                     | Zone     | Status   | State | Updated_at                 | Disabled Reason       |
    +----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+
    | 1  | nova-conductor   | ardana-cp1-c1-m1-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:33.000000 | -                     |
    | 10 | nova-scheduler   | ardana-cp1-c1-m1-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:34.000000 | -                     |
    | 13 | nova-conductor   | ardana-cp1-c1-m3-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:33.000000 | -                     |
    | 16 | nova-conductor   | ardana-cp1-c1-m2-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:33.000000 | -                     |
    | 28 | nova-scheduler   | ardana-cp1-c1-m2-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:28.000000 | -                     |
    | 31 | nova-scheduler   | ardana-cp1-c1-m3-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:32.000000 | -                     |
    | 34 | nova-compute     | ardana-cp1-comp0001-mgmt | AZ1      | enabled  | up    | 2015-11-22T23:04:25.000000 | -                     |
    | 37 | nova-compute     | ardana-cp1-comp0002-mgmt | nova     | disabled | up    | 2015-11-22T23:04:34.000000 | hardware reallocation |
    +----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+
15.1.3.5.3 Use Live Migration to Move Any Instances on this Host to Other Hosts
  1. You will need to verify if the Compute node is currently hosting any instances on it. You can do this with the command below:

    ardana > openstack server list --host nova hostname --all_tenants=1

    Here is an example below which shows that we have a single running instance on this node currently:

    ardana > openstack server list --host ardana-cp1-comp0002-mgmt --all-projects
    +--------------------------------------+--------+----------------------------------+--------+------------+-------------+-----------------+
    | ID                                   | Name   | Tenant ID                        | Status | Task State | Power State | Networks        |
    +--------------------------------------+--------+----------------------------------+--------+------------+-------------+-----------------+
    | 78fdb938-a89c-4a0c-a0d4-b88f1555c3b9 | paul4d | 5e9998f1b1824ea9a3b06ad142f09ca5 | ACTIVE | -          | Running     | paul=10.10.10.7 |
    +--------------------------------------+--------+----------------------------------+--------+------------+-------------+-----------------+
  2. You will likely want to migrate this instance off of this node before removing it. You can do this with the live migration functionality within nova. The command will look like this:

    ardana > nova live-migration --block-migrate nova instance ID

    Here is an example using the instance in the previous step:

    ardana > nova live-migration --block-migrate 78fdb938-a89c-4a0c-a0d4-b88f1555c3b9

    You can check the status of the migration using the same command from the previous step:

    ardana > openstack server list --host ardana-cp1-comp0002-mgmt --all-projects
    +--------------------------------------+--------+----------------------------------+-----------+------------+-------------+-----------------+
    | ID                                   | Name   | Tenant ID                        | Status    | Task State | Power State | Networks        |
    +--------------------------------------+--------+----------------------------------+-----------+------------+-------------+-----------------+
    | 78fdb938-a89c-4a0c-a0d4-b88f1555c3b9 | paul4d | 5e9998f1b1824ea9a3b06ad142f09ca5 | MIGRATING | migrating  | Running     | paul=10.10.10.7 |
    +--------------------------------------+--------+----------------------------------+-----------+------------+-------------+-----------------+
  3. List the compute instances again to see that the running instance has been migrated:

    ardana > openstack server list --host ardana-cp1-comp0002-mgmt --all-projects
    +----+------+-----------+--------+------------+-------------+----------+
    | ID | Name | Tenant ID | Status | Task State | Power State | Networks |
    +----+------+-----------+--------+------------+-------------+----------+
    +----+------+-----------+--------+------------+-------------+----------+
15.1.3.5.4 Disable Neutron Agents on Node to be Removed

You should also locate and disable or remove neutron agents. To see the neutron agents running:

ardana > openstack network agent list | grep NODE_NAME
+--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+
| id                                   | agent_type           | host                     | alive | admin_state_up | binary                    |
+--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+
| 08f16dbc-4ba2-4c1d-a4a3-a2ff2526ebe4 | L3 agent             | ardana-cp1-comp0002-mgmt | :-)   | True           | neutron-l3-agent          |
| dbe4fe11-8f08-4306-8244-cc68e98bb770 | Metadata agent       | ardana-cp1-comp0002-mgmt | :-)   | True           | neutron-metadata-agent    |
| f0d262d1-7139-40c7-bdc2-f227c6dee5c8 | Open vSwitch agent   | ardana-cp1-comp0002-mgmt | :-)   | True           | neutron-openvswitch-agent |
+--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+

ardana > openstack network agent set --disable 08f16dbc-4ba2-4c1d-a4a3-a2ff2526ebe4
ardana > openstack network agent set --disable dbe4fe11-8f08-4306-8244-cc68e98bb770
ardana > openstack network agent set --disable f0d262d1-7139-40c7-bdc2-f227c6dee5c8
15.1.3.5.5 Shut down or Stop the Nova and Neutron Services on the Compute Host

To perform this step you have a few options. You can SSH into the Compute host and run the following commands:

tux > sudo systemctl stop nova-compute
tux > sudo systemctl stop neutron-*

Because the neutron agent self-registers against neutron server, you may want to prevent the following services from coming back online. Here is how you can get the list:

tux > sudo systemctl list-units neutron-* --all

Here are the results:

UNIT                                  LOAD        ACTIVE     SUB      DESCRIPTION
neutron-common-rundir.service         loaded      inactive   dead     Create /var/run/neutron
•neutron-dhcp-agent.service         not-found     inactive   dead     neutron-dhcp-agent.service
neutron-l3-agent.service              loaded      inactive   dead     neutron-l3-agent Service
neutron-metadata-agent.service        loaded      inactive   dead     neutron-metadata-agent Service
•neutron-openvswitch-agent.service    loaded      failed     failed   neutron-openvswitch-agent Service
neutron-ovs-cleanup.service           loaded      inactive   dead     neutron OVS Cleanup Service

        LOAD   = Reflects whether the unit definition was properly loaded.
        ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
        SUB    = The low-level unit activation state, values depend on unit type.

        7 loaded units listed.
        To show all installed unit files use 'systemctl list-unit-files'.

For each loaded service issue the command

tux > sudo systemctl disable service-name

In the above example that would be each service, except neutron-dhcp-agent.service

For example:

tux > sudo systemctl disable neutron-common-rundir neutron-l3-agent neutron-metadata-agent neutron-openvswitch-agent

Now you can shut down the node:

tux > sudo shutdown now

OR

From the Cloud Lifecycle Manager you can use the bm-power-down.yml playbook to shut down the node:

ardana > cd ~/openstack/ardana/ansible
ardana > ansible-playbook -i hosts/localhost bm-power-down.yml -e nodelist=node name
Note
Note

The node name value will be the value corresponding to this node in Cobbler. You can run sudo cobbler system list to retrieve these names.

15.1.3.5.6 Delete the Compute Host from Nova

Retrieve the list of nova services:

ardana > openstack compute service list

Here is an example highlighting the Compute host we're going to remove:

ardana > openstack compute service list
+----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+
| Id | Binary           | Host                     | Zone     | Status   | State | Updated_at                 | Disabled Reason       |
+----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+
| 1  | nova-conductor   | ardana-cp1-c1-m1-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:33.000000 | -                     |
| 10 | nova-scheduler   | ardana-cp1-c1-m1-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:34.000000 | -                     |
| 13 | nova-conductor   | ardana-cp1-c1-m3-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:33.000000 | -                     |
| 16 | nova-conductor   | ardana-cp1-c1-m2-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:33.000000 | -                     |
| 28 | nova-scheduler   | ardana-cp1-c1-m2-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:28.000000 | -                     |
| 31 | nova-scheduler   | ardana-cp1-c1-m3-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:32.000000 | -                     |
| 34 | nova-compute     | ardana-cp1-comp0001-mgmt | AZ1      | enabled  | up    | 2015-11-22T23:04:25.000000 | -                     |
| 37 | nova-compute     | ardana-cp1-comp0002-mgmt | nova     | disabled | up    | 2015-11-22T23:04:34.000000 | hardware reallocation |
+----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+

Delete the host from nova using the command below:

ardana > openstack compute service delete service ID

Following our example above, you would use:

ardana > openstack compute service delete 37

Use the command below to confirm that the Compute host has been completely removed from nova:

ardana > openstack hypervisor list
15.1.3.5.7 Delete the Compute Host from Neutron

Multiple neutron agents are running on the compute node. You have to remove all of the agents running on the node using the openstack network agent delete command. In the example below, the l3-agent, openvswitch-agent and metadata-agent are running:

ardana > openstack network agent list | grep NODE_NAME
+--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+
| id                                   | agent_type           | host                     | alive | admin_state_up | binary                    |
+--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+
| 08f16dbc-4ba2-4c1d-a4a3-a2ff2526ebe4 | L3 agent             | ardana-cp1-comp0002-mgmt | :-)   | False          | neutron-l3-agent          |
| dbe4fe11-8f08-4306-8244-cc68e98bb770 | Metadata agent       | ardana-cp1-comp0002-mgmt | :-)   | False          | neutron-metadata-agent    |
| f0d262d1-7139-40c7-bdc2-f227c6dee5c8 | Open vSwitch agent   | ardana-cp1-comp0002-mgmt | :-)   | False          | neutron-openvswitch-agent |
+--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+

$ openstack network agent delete AGENT_ID

$ openstack network agent delete 08f16dbc-4ba2-4c1d-a4a3-a2ff2526ebe4
$ openstack network agent delete dbe4fe11-8f08-4306-8244-cc68e98bb770
$ openstack network agent delete f0d262d1-7139-40c7-bdc2-f227c6dee5c8
15.1.3.5.8 Remove the Compute Host from the servers.yml File and Run the Configuration Processor

Complete these steps from the Cloud Lifecycle Manager to remove the Compute node:

  1. Log in to the Cloud Lifecycle Manager

  2. Edit your servers.yml file in the location below to remove references to the Compute node(s) you want to remove:

    ardana > cd ~/openstack/my_cloud/definition/data
    ardana > vi servers.yml
  3. You may also need to edit your control_plane.yml file to update the values for member-count, min-count, and max-count if you used those to ensure they reflect the exact number of nodes you are using.

    See Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”, Section 6.2 “Control Plane” for more details.

  4. Commit the changes to git:

    ardana > git commit -a -m "Remove node NODE_NAME"
  5. To release the network capacity allocated to the deleted server(s), use the switches remove_deleted_servers and free_unused_addresses when running the configuration processors. (For more information, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 7 “Other Topics”, Section 7.3 “Persisted Data”.)

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml \
      -e remove_deleted_servers="y" -e free_unused_addresses="y"
  6. Update your deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  7. Refresh the /etc/hosts file through the cloud to remove references to the old node:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts site.yml --tag "generate_hosts_file"
15.1.3.5.9 Remove the Compute Host from Cobbler

Complete these steps to remove the node from Cobbler:

  1. Confirm the system name in Cobbler with this command:

    tux > sudo  cobbler system list
  2. Remove the system from Cobbler using this command:

    tux > sudo  cobbler system remove --name=node
  3. Run the cobbler-deploy.yml playbook to complete the process:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost cobbler-deploy.yml
15.1.3.5.10 Remove the Compute Host from Monitoring

Once you have removed the Compute nodes, the alarms against them will trigger so there are additional steps to take to resolve this issue.

To find all monasca API servers

tux > sudo cat /etc/haproxy/haproxy.cfg | grep MON
listen ardana-cp1-vip-public-MON-API-extapi-8070
    bind ardana-cp1-vip-public-MON-API-extapi:8070  ssl crt /etc/ssl/private//my-public-cert-entry-scale
    server ardana-cp1-c1-m1-mgmt-MON_API-8070 ardana-cp1-c1-m1-mgmt:8070 check inter 5000 rise 2 fall 5
    server ardana-cp1-c1-m2-mgmt-MON_API-8070 ardana-cp1-c1-m2-mgmt:8070 check inter 5000 rise 2 fall 5
    server ardana-cp1-c1-m3-mgmt-MON_API-8070 ardana-cp1-c1-m3-mgmt:8070 check inter 5000 rise 2 fall 5
listen ardana-cp1-vip-MON-API-mgmt-8070
    bind ardana-cp1-vip-MON-API-mgmt:8070  ssl crt /etc/ssl/private//ardana-internal-cert
    server ardana-cp1-c1-m1-mgmt-MON_API-8070 ardana-cp1-c1-m1-mgmt:8070 check inter 5000 rise 2 fall 5
    server ardana-cp1-c1-m2-mgmt-MON_API-8070 ardana-cp1-c1-m2-mgmt:8070 check inter 5000 rise 2 fall 5
    server ardana-cp1-c1-m3-mgmt-MON_API-8070 ardana-cp1-c1-m3-mgmt:8070 check inter 5000 rise 2 fall 5

In above example ardana-cp1-c1-m1-mgmt,ardana-cp1-c1-m2-mgmt, ardana-cp1-c1-m3-mgmt are Monasa API servers

You will want to SSH to each of the monasca API servers and edit the /etc/monasca/agent/conf.d/host_alive.yaml file to remove references to the Compute node you removed. This will require sudo access. The entries will look similar to the one below:

- alive_test: ping
  built_by: HostAlive
  host_name: ardana-cp1-comp0001-mgmt
  name: ardana-cp1-comp0001-mgmt ping

Once you have removed the references on each of your monasca API servers you then need to restart the monasca-agent on each of those servers with this command:

tux > sudo service openstack-monasca-agent restart

With the Compute node references removed and the monasca-agent restarted, you can then delete the corresponding alarm to finish this process. To do so we recommend using the monasca CLI which should be installed on each of your monasca API servers by default:

ardana > monasca alarm-list --metric-dimensions hostname=compute node deleted

For example, if your Compute node looked like the example above then you would use this command to get the alarm ID:

ardana > monasca alarm-list --metric-dimensions hostname=ardana-cp1-comp0001-mgmt

You can then delete the alarm with this command:

ardana > monasca alarm-delete alarm ID

15.1.4 Planned Network Maintenance

Planned maintenance task for networking nodes.

15.1.4.1 Adding a Network Node

Adding an additional neutron networking node allows you to increase the performance of your cloud.

You may have a need to add an additional neutron network node for increased performance or another purpose and these steps will help you achieve this.

15.1.4.1.1 Prerequisites

If you are using the mid-scale model then your networking nodes are already separate and the roles are defined. If you are not already using this model and wish to add separate networking nodes then you need to ensure that those roles are defined. You can look in the ~/openstack/examples folder on your Cloud Lifecycle Manager for the mid-scale example model files which show how to do this. We have also added the basic edits that need to be made below:

  1. In your server_roles.yml file, ensure you have the NEUTRON-ROLE defined.

    Path to file:

    ~/openstack/my_cloud/definition/data/server_roles.yml

    Example snippet:

    - name: NEUTRON-ROLE
      interface-model: NEUTRON-INTERFACES
      disk-model: NEUTRON-DISKS
  2. In your net_interfaces.yml file, ensure you have the NEUTRON-INTERFACES defined.

    Path to file:

    ~/openstack/my_cloud/definition/data/net_interfaces.yml

    Example snippet:

    - name: NEUTRON-INTERFACES
      network-interfaces:
      - device:
          name: hed3
        name: hed3
        network-groups:
        - EXTERNAL-VM
        - GUEST
        - MANAGEMENT
  3. Create a disks_neutron.yml file, ensure you have the NEUTRON-DISKS defined in it.

    Path to file:

    ~/openstack/my_cloud/definition/data/disks_neutron.yml

    Example snippet:

      product:
        version: 2
    
      disk-models:
      - name: NEUTRON-DISKS
        volume-groups:
          - name: ardana-vg
            physical-volumes:
             - /dev/sda_root
    
            logical-volumes:
            # The policy is not to consume 100% of the space of each volume group.
            # 5% should be left free for snapshots and to allow for some flexibility.
              - name: root
                size: 35%
                fstype: ext4
                mount: /
              - name: log
                size: 50%
                mount: /var/log
                fstype: ext4
                mkfs-opts: -O large_file
              - name: crash
                size: 10%
                mount: /var/crash
                fstype: ext4
                mkfs-opts: -O large_file
  4. Modify your control_plane.yml file, ensure you have the NEUTRON-ROLE defined as well as the neutron services added.

    Path to file:

    ~/openstack/my_cloud/definition/data/control_plane.yml

    Example snippet:

      - allocation-policy: strict
        cluster-prefix: neut
        member-count: 1
        name: neut
        server-role: NEUTRON-ROLE
        service-components:
        - ntp-client
        - neutron-vpn-agent
        - neutron-dhcp-agent
        - neutron-metadata-agent
        - neutron-openvswitch-agent

You should also have one or more baremetal servers that meet the minimum hardware requirements for a network node which are documented in the Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 2 “Hardware and Software Support Matrix”.

15.1.4.1.2 Adding a network node

These steps will show you how to add the new network node to your servers.yml file and then run the playbooks that update your cloud configuration. You will run these playbooks from the lifecycle manager.

  1. Log in to your Cloud Lifecycle Manager.

  2. Checkout the site branch of your local git so you can begin to make the necessary edits:

    ardana > cd ~/openstack/my_cloud/definition/data
    ardana > git checkout site
  3. In the same directory, edit your servers.yml file to include the details about your new network node(s).

    For example, if you already had a cluster of three network nodes and needed to add a fourth one you would add your details to the bottom of the file in this format:

    # network nodes
    - id: neut3
      ip-addr: 10.13.111.137
      role: NEUTRON-ROLE
      server-group: RACK2
      mac-addr: "5c:b9:01:89:b6:18"
      nic-mapping: HP-DL360-6PORT
      ip-addr: 10.243.140.22
      ilo-ip: 10.1.12.91
      ilo-password: password
      ilo-user: admin
    Important
    Important

    You will need to verify that the ip-addr value you choose for this node does not conflict with any other IP address in your cloud environment. You can confirm this by checking the ~/openstack/my_cloud/info/address_info.yml file on your Cloud Lifecycle Manager.

  4. In your control_plane.yml file you will need to check the values for member-count, min-count, and max-count, if you specified them, to ensure that they match up with your new total node count. So for example, if you had previously specified member-count: 3 and are adding a fourth network node, you will need to change that value to member-count: 4.

  5. Commit the changes to git:

    ardana > git commit -a -m "Add new networking node <name>"
  6. Run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  7. Update your deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  8. Add the new node into Cobbler:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost cobbler-deploy.yml
  9. Then you can image the node:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<hostname>
    Note
    Note

    If you do not know the <hostname>, you can get it by using sudo cobbler system list.

  10. [OPTIONAL] - Run the wipe_disks.yml playbook to ensure all of your non-OS partitions on your nodes are completely wiped prior to continuing with the installation. The wipe_disks.yml playbook is only meant to be run on systems immediately after running bm-reimage.yml. If used for any other case, it may not wipe all of the expected partitions.

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts wipe_disks.yml --limit <hostname>
  11. Configure the operating system on the new networking node with this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <hostname>
  12. Complete the networking node deployment with this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-deploy.yml --limit <hostname>
  13. Run the site.yml playbook with the required tag so that all other services become aware of the new node:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts site.yml --tags "generate_hosts_file"
15.1.4.1.3 Adding a New Network Node to Monitoring

If you want to add a new networking node to the monitoring service checks, there is an additional playbook that must be run to ensure this happens:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts monasca-deploy.yml --tags "active_ping_checks"

15.1.5 Planned Storage Maintenance

Planned maintenance procedures for swift storage nodes.

15.1.5.1 Planned Maintenance Tasks for swift Nodes

Planned maintenance tasks including recovering, adding, and removing swift nodes.

15.1.5.1.1 Adding a Swift Object Node

Adding additional object nodes allows you to increase capacity.

This topic describes how to add additional swift object server nodes to an existing system.

15.1.5.1.1.1 To add a new node

To add a new node to your cloud, you will need to add it to servers.yml, and then run the scripts that update your cloud configuration. To begin, access the servers.yml file by checking out the Git branch where you are required to make the changes:

Then, perform the following steps to add a new node:

  1. Log in to the Cloud Lifecycle Manager node.

  2. Get the servers.yml file stored in Git:

    cd ~/openstack/my_cloud/definition/data
    git checkout site
  3. If not already done, set the weight-step attribute. For instructions, see Section 9.5.2, “Using the Weight-Step Attributes to Prepare for Ring Changes”.

  4. Add the details of new nodes to the servers.yml file. In the following example only one new server swobj4 is added. However, you can add multiple servers by providing the server details in the servers.yml file:

    servers:
    ...
    - id: swobj4
      role: SWOBJ_ROLE
      server-group: <server-group-name>
      mac-addr: <mac-address>
      nic-mapping: <nic-mapping-name>
      ip-addr: <ip-address>
      ilo-ip: <ilo-ip-address>
      ilo-user: <ilo-username>
      ilo-password: <ilo-password>
  5. Commit your changes:

    git add -A
    git commit -m "Add Node <name>"
    Note
    Note

    Before you run any playbooks, remember that you need to export the encryption key in the following environment variable:

    export HOS_USER_PASSWORD_ENCRYPT_KEY=ENCRYPTION_KEY

    For instructions, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 30 “Installation for SUSE OpenStack Cloud Entry-scale Cloud with Swift Only”.

  6. Run the configuration processor:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost config-processor-run.yml
  7. Create a deployment directory:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost ready-deployment.yml
  8. Configure Cobbler to include the new node, and then reimage the node (if you are adding several nodes, use a comma-separated list with the nodelist argument):

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost cobbler-deploy.yml
    ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<server-id>

    In the following example, the server id is swobj4 (mentioned in step 3):

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost cobbler-deploy.yml
    ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=swobj4
    Note
    Note

    You must use the server id as it appears in the file servers.yml in the field id.

  9. Configure the operating system:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <hostname>

    The hostname of the newly added server can be found in the list generated from the output of the following command:

    grep hostname ~/openstack/my_cloud/info/server_info.yml

    For example, for swobj4, the hostname is ardana-cp1-swobj0004-mgmt.

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit ardana-cp1-swobj0004-mgmt
  10. Validate that the disk drives of the new node are compatible with the disk model used by the node:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-compare-model-rings.yml --limit SWF*

    If any errors occur, correct them. For instructions, see Section 18.6.2.3, “Interpreting Swift Input Model Validation Errors”.

  11. Run the following playbook to ensure that all other server's host file are updated with the new server:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts site.yml --tags "generate_hosts_file"
  12. Run the ardana-deploy.yml playbook to rebalance the rings to include the node, deploy the rings, and configure the new node. Do not limit this to just the node (swobj4) that you are adding:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts ardana-deploy.yml
  13. You may need to perform further rebalances of the rings. For instructions, see the "Weight Change Phase of Ring Rebalance" and the "Final Rebalance Phase" sections of Section 9.5.5, “Applying Input Model Changes to Existing Rings”.

    For example:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-update-from-model-rebalance-rings.yml
15.1.5.1.2 Adding a Swift Proxy, Account, Container (PAC) Node

Steps for adding additional PAC nodes to your swift system.

This topic describes how to add additional swift proxy, account, and container (PAC) servers to an existing system.

15.1.5.1.2.1 Adding a new node

To add a new node to your cloud, you will need to add it to servers.yml, and then run the scripts that update your cloud configuration. To begin, access the servers.yml file by checking out the Git branch where you are required to make the changes:

Then, perform the following steps to add a new node:

  1. Log in to the Cloud Lifecycle Manager.

  2. Get the servers.yml file stored in Git:

    cd ~/openstack/my_cloud/definition/data
    git checkout site
  3. If not already done, set the weight-step attribute. For instructions, see Section 9.5.2, “Using the Weight-Step Attributes to Prepare for Ring Changes”.

  4. Add details of new nodes to the servers.yml file:

    servers:
    ...
    - id: swpac6
      role: SWPAC-ROLE
      server-group: <server-group-name>
      mac-addr: <mac-address>
      nic-mapping: <nic-mapping-name>
      ip-addr: <ip-address>
      ilo-ip: <ilo-ip-address>
      ilo-user: <ilo-username>
      ilo-password: <ilo-password>

    In the above example, only one new server swpac6 is added. However, you can add multiple servers by providing the server details in the servers.yml file.

    In the entry-scale configurations there is no dedicated swift PAC cluster. Instead, there is a cluster using servers that have a role of CONTROLLER-ROLE. You cannot add additional nodes dedicated exclusively to swift PAC because that would change the member-count of the entire cluster. In that case, to create a dedicated swift PAC cluster, you will need to add it to the configuration files. For details on how to do this, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 11 “Modifying Example Configurations for Object Storage using Swift”, Section 11.7 “Creating a Swift Proxy, Account, and Container (PAC) Cluster”.

    If using a new PAC nodes you must add the PAC node's configuration details in the following yaml files:

    control_plane.yml
    disks_pac.yml
    net_interfaces.yml
    servers.yml
    server_roles.yml

    You can see a good example of this in the example configurations for the mid-scale model in the ~/openstack/examples/mid-scale-kvm directory.

    The following steps assume that you have already created a dedicated swift PAC cluster and that it has two members (swpac4 and swpac5).

  5. Set the member count of the swift PAC cluster to match the number of nodes. For example, if you are adding swpac6 as the 6th swift PAC node, the member count should be increased from 5 to 6 as shown in the following example:

    control-planes:
        - name: control-plane-1
          control-plane-prefix: cp1
    
      . . .
      clusters:
      . . .
         - name: swpac
           cluster-prefix: swpac
           server-role: SWPAC-ROLE
           member-count: 6
       . . .
  6. Commit your changes:

    git add -A
    git commit -m "Add Node <name>"
    Note
    Note

    Before you run any playbooks, remember that you need to export the encryption key in the following environment variable:

    export HOS_USER_PASSWORD_ENCRYPT_KEY=ENCRYPTION_KEY

    For instructions, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 30 “Installation for SUSE OpenStack Cloud Entry-scale Cloud with Swift Only”.

  7. Run the configuration processor:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost config-processor-run.yml
  8. Create a deployment directory:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost ready-deployment.yml
  9. Configure Cobbler to include the new node and reimage the node (if you are adding several nodes, use a comma-separated list for the nodelist argument):

    ansible-playbook -i hosts/localhost cobbler-deploy.yml
    ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<server-id>

    In the following example, the server id is swpac6 (mentioned in step 3):

    ansible-playbook -i hosts/localhost cobbler-deploy.yml
    ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=swpac6
    Note
    Note

    You must use the server id as it appears in the file servers.yml in the field id.

  10. Review the cloudConfig.yml and data/control_plane.yml files to get the host prefix (for example, openstack) and the control plane name (for example, cp1). This gives you the hostname of the node. Configure the operating system:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <hostname>

    For example, for swpac6, the hostname is ardana-cp1-c2-m3-mgmt:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit ardana-cp1-c2-m3-mgmt
  11. Validate that the disk drives of the new node are compatible with the disk model used by the node:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-compare-model-rings.yml

    If any errors occur, correct them. For instructions, see Section 18.6.2.3, “Interpreting Swift Input Model Validation Errors”.

  12. Run the following playbook to ensure that all other server's host file are updated with the new server:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts site.yml --tags "generate_hosts_file"
  13. Run the ardana-deploy.yml playbook to rebalance the rings to include the node, deploy the rings, and configure the new node. Do not limit this to just the node (swpac6) that you are adding:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts ardana-deploy.yml
  14. You may need to perform further rebalances of the rings. For instructions, see the "Weight Change Phase of Ring Rebalance" and the "Final Rebalance Phase" sections of Section 9.5.5, “Applying Input Model Changes to Existing Rings”.

15.1.5.1.3 Adding Additional Disks to a Swift Node

Steps for adding additional disks to any nodes hosting swift services.

You may have a need to add additional disks to a node for swift usage and we can show you how. These steps work for adding additional disks to swift object or proxy, account, container (PAC) nodes. It can also apply to adding additional disks to a controller node that is hosting the swift service, like you would see if you are using one of the entry-scale example models.

Read through the notes below before beginning the process.

You can add multiple disks at the same time, there is no need to do it one at a time.

Important
Important: Add the Same Number of Disks

You must add the same number of disks to each server that the disk model applies to. For example, if you have a single cluster of three swift servers and you want to increase capacity and decide to add two additional disks, you must add two to each of your three swift servers.

15.1.5.1.3.1 Adding additional disks to your Swift servers
  1. Verify the general health of the swift system and that it is safe to rebalance your rings. See Section 9.5.4, “Determining When to Rebalance and Deploy a New Ring” for details on how to do this.

  2. Perform the disk maintenance.

    1. Shut down the first swift server you wish to add disks to.

    2. Add the additional disks to the physical server. The disk drives that are added should be clean. They should either contain no partitions or a single partition the size of the entire disk. It should not contain a file system or any volume groups. Failure to comply will cause errors and the disk will not be added.

      For more details, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 11 “Modifying Example Configurations for Object Storage using Swift”, Section 11.6 “Swift Requirements for Device Group Drives”.

    3. Power the server on.

    4. While the server was shutdown, data that normally would have been placed on the server is placed elsewhere. When the server is rebooted, the swift replication process will move that data back onto the server. Monitor the replication process to determine when it is complete. See Section 9.5.4, “Determining When to Rebalance and Deploy a New Ring” for details on how to do this.

    5. Repeat the steps from Step 2.a for each of the swift servers you are adding the disks to, one at a time.

      Note
      Note

      If the additional disks can be added to the swift servers online (for example, via hotplugging) then there is no need to perform the last two steps.

  3. On the Cloud Lifecycle Manager, update your cloud configuration with the details of your additional disks.

    1. Edit the disk configuration file that correlates to the type of server you are adding your new disks to.

      Path to the typical disk configuration files:

      ~/openstack/my_cloud/definition/data/disks_swobj.yml
      ~/openstack/my_cloud/definition/data/disks_swpac.yml
      ~/openstack/my_cloud/definition/data/disks_controller_*.yml

      Example showing the addition of a single new disk, indicated by the /dev/sdd, in bold:

      device-groups:
        - name: swiftObject
          devices:
            - name: "/dev/sdb"
            - name: "/dev/sdc"
            - name: "/dev/sdd"
          consumer:
            name: swift
            ...
      Note
      Note

      For more details on how the disk model works, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”.

    2. Configure the swift weight-step value in the ~/openstack/my_cloud/definition/data/swift/swift_config.yml file. See Section 9.5.2, “Using the Weight-Step Attributes to Prepare for Ring Changes” for details on how to do this.

    3. Commit the changes to Git:

      cd ~/openstack
      git commit -a -m "adding additional swift disks"
    4. Run the configuration processor:

      cd ~/openstack/ardana/ansible
      ansible-playbook -i hosts/localhost config-processor-run.yml
    5. Update your deployment directory:

      cd ~/openstack/ardana/ansible
      ansible-playbook -i hosts/localhost ready-deployment.yml
  4. Run the osconfig-run.yml playbook against the swift nodes you have added disks to. Use the --limit switch to target the specific nodes:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <hostnames>

    You can use a wildcard when specifying the hostnames with the --limit switch. If you added disks to all of the swift servers in your environment and they all have the same prefix (for example, ardana-cp1-swobj...) then you can use a wildcard like ardana-cp1-swobj*. If you only added disks to a set of nodes but not all of them, you can use a comma deliminated list and enter the hostnames of each of the nodes you added disks to.

  5. Validate your swift configuration with this playbook which will also provide details of each drive being added:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-compare-model-rings.yml --extra-vars "drive_detail=yes"
  6. Verify that swift services are running on all of your servers:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-status.yml
  7. If everything looks okay with the swift status, then apply the changes to your swift rings with this playbook:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-deploy.yml
  8. At this point your swift rings will begin rebalancing. You should wait until replication has completed or min-part-hours has elapsed (whichever is longer), as described in Section 9.5.4, “Determining When to Rebalance and Deploy a New Ring” and then follow the "Weight Change Phase of Ring Rebalance" process as described in Section 9.5.5, “Applying Input Model Changes to Existing Rings”.

15.1.5.1.4 Removing a Swift Node

Removal process for both swift Object and PAC nodes.

You can use this process when you want to remove one or more swift nodes permanently. This process applies to both swift Proxy, Account, Container (PAC) nodes and swift Object nodes.

15.1.5.1.4.1 Setting the Pass-through Attributes

This process will remove the swift node's drives from the rings and rebalance their responsibilities among the remaining nodes in your cluster. Note that removal will not succeed if it causes the number of remaining disks in the cluster to decrease below the replica count of its rings.

  1. Log in to the Cloud Lifecycle Manager.

  2. Ensure that the weight-step attribute is set. See Section 9.5.2, “Using the Weight-Step Attributes to Prepare for Ring Changes” for more details.

  3. Add the pass-through definition to your input model, specifying the server ID (as opposed to the server name). It is easiest to include in your ~/openstack/my_cloud/definition/data/servers.yml file since your server IDs are already listed in that file. For more information about pass-through, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”, Section 6.17 “Pass Through”.

    Here is the format required, which can be inserted at the topmost level of indentation in your file (typically 2 spaces):

    pass-through:
      servers:
        - id: server-id
          data:
            subsystem:
              subsystem-attributes

    Here is an example:

    ---
      product:
        version: 2
    
      pass-through:
        servers:
          - id: ccn-0001
            data:
              swift:
                drain: yes

    If a pass-through definition already exists in any of your input model data files, just include the additional data for the server which you are removing instead of defining an entirely new pass-through block.

    By setting this pass-through attribute, you indicate that the system should reduce the weight of the server's drives. The weight reduction is determined by the weight-step attribute as described in the previous step. This process is known as "draining", where you remove the swift data from the node in preparation for removing the node.

  4. Commit your configuration to the local Git repository (see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”), as follows:

    cd ~/openstack/ardana/ansible
    git add -A
    git commit -m "My config or other commit message"
  5. Run the configuration processor:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost config-processor-run.yml
  6. Use the playbook to create a deployment directory:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost ready-deployment.yml
  7. Run the swift deploy playbook to perform the first ring rebuild. This will remove some of the partitions from all drives on the node you are removing:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-deploy.yml
  8. Wait until the replication has completed. For further details, see Section 9.5.4, “Determining When to Rebalance and Deploy a New Ring”

  9. Determine whether all of the partitions have been removed from all drives on the swift node you are removing. You can do this by SSH'ing into the first account server node and using these commands:

    cd /etc/swiftlm/cloud1/cp1/builder_dir/
    sudo swift-ring-builder ring_name.builder

    For example, if the node you are removing was part of the object-o ring the command would be:

    sudo swift-ring-builder object-0.builder

    Check the output. You will need to know the IP address of the server being drained. In the example below, the number of partitions of the drives on 192.168.245.3 has reached zero for the object-0 ring:

    $ cd /etc/swiftlm/cloud1/cp1/builder_dir/
    $ sudo swift-ring-builder object-0.builder
    account.builder, build version 6
    4096 partitions, 3.000000 replicas, 1 regions, 1 zones, 6 devices, 0.00 balance, 0.00 dispersion
    The minimum number of hours before a partition can be reassigned is 16
    The overload factor is 0.00% (0.000000)
    Devices:    id  region  zone      ip address  port  replication ip  replication port      name weight partitions balance meta
                 0       1     1   192.168.245.3  6002   192.168.245.3              6002     disk0   0.00          0   -0.00 padawan-ccp-c1-m1:disk0:/dev/sdc
                 1       1     1   192.168.245.3  6002   192.168.245.3              6002     disk1   0.00          0   -0.00 padawan-ccp-c1-m1:disk1:/dev/sdd
                 2       1     1   192.168.245.4  6002   192.168.245.4              6002     disk0  18.63       2048   -0.00 padawan-ccp-c1-m2:disk0:/dev/sdc
                 3       1     1   192.168.245.4  6002   192.168.245.4              6002     disk1  18.63       2048   -0.00 padawan-ccp-c1-m2:disk1:/dev/sdd
                 4       1     1   192.168.245.5  6002   192.168.245.5              6002     disk0  18.63       2048   -0.00 padawan-ccp-c1-m3:disk0:/dev/sdc
                 5       1     1   192.168.245.5  6002   192.168.245.5              6002     disk1  18.63       2048   -0.00 padawan-ccp-c1-m3:disk1:/dev/sdd
  10. If the number of partitions is zero for the server on all rings, you can move to the next step, otherwise continue the ring rebalance cycle by repeating steps 7-9 until the weight has reached zero.

  11. If the number of partitions is zero for the server on all rings, you can remove the swift nodes' drives from all rings. Edit the pass-through data you created in step #3 and set the remove attribute as shown in this example:

    ---
      product:
        version: 2
    
      pass-through:
        servers:
          - id: ccn-0001
            data:
              swift:
                remove: yes
  12. Commit your configuration to the local Git repository (see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”), as follows:

    cd ~/openstack/ardana/ansible
    git add -A
    git commit -m "My config or other commit message"
  13. Run the configuration processor:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost config-processor-run.yml
  14. Update your deployment directory:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost ready-deployment.yml
  15. Run the swift deploy playbook to rebuild the rings by removing the server:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-deploy.yml
  16. At this stage, the server has been removed from the rings and the data that was originally stored on the server has been replicated in a balanced way to the other servers in the system. You can proceed to the next phase.

15.1.5.1.4.2 To Disable Swift on a Node

The next phase in this process will disable the swift service on the node. In this example, swobj4 is the node being removed from swift.

  1. Log in to the Cloud Lifecycle Manager.

  2. Stop swift services on the node using the swift-stop.yml playbook:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-stop.yml --limit hostname
    Note
    Note

    When using the --limit argument, you must specify the full hostname (for example: ardana-cp1-swobj0004) or use the wild card * (for example, *swobj4*).

    The following example uses the swift-stop.yml playbook to stop swift services on ardana-cp1-swobj0004:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-stop.yml --limit ardana-cp1-swobj0004
  3. Remove the configuration files.

    ssh ardana-cp1-swobj4-mgmt sudo rm -R /etc/swift
    Note
    Note

    Do not run any other playbooks until you have finished the process described in Section 15.1.5.1.4.3, “To Remove a Node from the Input Model”. Otherwise, these playbooks may recreate /etc/swift and restart swift on swobj4. If you accidentally run a playbook, repeat the process in Section 15.1.5.1.4.2, “To Disable Swift on a Node”.

15.1.5.1.4.3 To Remove a Node from the Input Model

Use the following steps to finish the process of removing the swift node.

  1. Log in to the Cloud Lifecycle Manager.

  2. Edit the ~/openstack/my_cloud/definition/data/servers.yml file and remove the entry for the node (swobj4 in this example). In addition, remove the related entry you created in the pass-through section earlier in this process.

  3. If this was a SWPAC node, reduce the member-count attribute by 1 in the ~/openstack/my_cloud/definition/data/control_plane.yml file. For SWOBJ nodes, no such action is needed.

  4. Commit your configuration to the local Git repository (see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”), as follows:

    cd ~/openstack/ardana/ansible
    git add -A
    git commit -m "My config or other commit message"
  5. Run the configuration processor:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost config-processor-run.yml

    Using the remove_deleted_servers and free_unused_addresses switches is recommended to free up the resources associated with the removed node when running the configuration processor. For more information, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 7 “Other Topics”, Section 7.3 “Persisted Data”.

    ansible-playbook -i hosts/localhost config-processor-run.yml -e remove_deleted_servers="y" -e free_unused_addresses="y"
  6. Update your deployment directory:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost ready-deployment.yml
  7. Validate the changes you have made to the configuration files using the playbook below before proceeding further:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-compare-model-rings.yml --limit SWF*

    If any errors occur, correct them in your configuration files and repeat steps 3-5 again until no more errors occur before going to the next step.

    For more details on how to interpret and resolve errors, see Section 18.6.2.3, “Interpreting Swift Input Model Validation Errors”

  8. Remove the node from Cobbler:

    sudo cobbler system remove --name=swobj4
  9. Run the Cobbler deploy playbook:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost cobbler-deploy.yml
  10. The final step will depend on what type of swift node you are removing.

    If the node was a SWPAC node, run the ardana-deploy.yml playbook:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts ardana-deploy.yml

    If the node was a SWOBJ node (and not a SWPAC node), run the swift-deploy.yml playbook:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-deploy.yml
  11. Wait until replication has finished. For more details, see Section 9.5.4, “Determining When to Rebalance and Deploy a New Ring”.

  12. You may need to continue to rebalance the rings. For instructions, see Final Rebalance Phase at Section 9.5.5, “Applying Input Model Changes to Existing Rings”.

15.1.5.1.4.4 Remove the Swift Node from Monitoring

Once you have removed the swift node(s), the alarms against them will trigger so there are additional steps to take to resolve this issue.

Connect to each of the nodes in your cluster running the monasca-api service (as defined in ~/openstack/my_cloud/definition/data/control_plane.yml) and use sudo vi /etc/monasca/agent/conf.d/host_alive.yaml to delete all references to the swift node(s) you removed.

Once you have removed the references on each of your monasca API servers you then need to restart the monasca-agent on each of those servers with this command:

tux > sudo service openstack-monasca-agent restart

With the swift node references removed and the monasca-agent restarted, you can then delete the corresponding alarm to finish this process. To do so we recommend using the monasca CLI which should be installed on each of your monasca API servers by default:

monasca alarm-list --metric-dimensions hostname=swift node deleted

You can then delete the alarm with this command:

monasca alarm-delete alarm ID
15.1.5.1.5 Replacing a swift Node

Maintenance steps for replacing a failed swift node in your environment.

This process is used when you want to replace a failed swift node in your cloud.

Warning
Warning

If it applies to the server, do not skip step 10. If you do, the system will overwrite the existing rings with new rings. This will not cause data loss, but, potentially, will move most objects in your system to new locations and may make data unavailable until the replication process has completed.

15.1.5.1.5.1 How to replace a swift node in your environment
  1. Log in to the Cloud Lifecycle Manager.

  2. Power off the node.

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost bm-power-down.yml -e nodelist=OLD_SWIFT_CONTROLLER_NODE
  3. Update your cloud configuration with the details of your replacement swift. node.

    1. Edit your servers.yml file to include the details (MAC address, IPMI user, password, and IP address (IPME) if these have changed) about your replacement swift node.

      Note
      Note

      Do not change the server's IP address (that is, ip-addr).

      Path to file:

      ~/openstack/my_cloud/definition/data/servers.yml

      Example showing the fields to edit, in bold:

       - id: swobj5
         role: SWOBJ-ROLE
         server-group: rack2
         mac-addr: 8c:dc:d4:b5:cb:bd
         nic-mapping: HP-DL360-6PORT
         ip-addr: 10.243.131.10
         ilo-ip: 10.1.12.88
         ilo-user: iLOuser
         ilo-password: iLOpass
         ...
    2. Commit the changes to Git:

      ardana > cd ~/openstack
      ardana > git commit -a -m "replacing a swift node"
    3. Run the configuration processor:

      ardana > cd ~/openstack/ardana/ansible
      ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
    4. Update your deployment directory:

      ardana > cd ~/openstack/ardana/ansible
      ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  4. Prepare SLES:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook prepare-sles-loader.yml
    ardana > ansible-playbook prepare-sles-grub2.yml -e nodelist=NEW REPLACEMENT NODE
  5. Update Cobbler and reimage your replacement swift node:

    1. Obtain the name in Cobbler for your node you wish to remove. You will use this value to replace <node name> in future steps.

      ardana > sudo cobbler system list
    2. Remove the replaced swift node from Cobbler:

      ardana > sudo cobbler system remove --name <node name>
    3. Re-run the cobbler-deploy.yml playbook to add the replaced node:

      ardana > cd ~/scratch/ansible/next/ardana/ansible
      ardana > ansible-playbook -i hosts/localhost cobbler-deploy.yml
    4. Reimage the node using this playbook:

      ardana > cd ~/openstack/ardana/ansible
      ardana > ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<node name>
  6. Wipe the disks on the NEW REPLACEMENT NODE. This action will not affect the OS partitions on the server.

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts wipe_disks.yml --limit NEW_REPLACEMENT_NODE
  7. Complete the deployment of your replacement swift node.

    1. Obtain the hostname for your new swift node. Use this value to replace <hostname> in future steps.

      ardana > cat ~/openstack/my_cloud/info/server_info.yml
    2. Configure the operating system on your replacement swift node:

      ardana > cd ~/scratch/ansible/next/ardana/ansible
      ardana > ansible-playbook -i hosts/verb_hosts osconfig-run.yml -limit <hostname>
    3. If this is the swift ring builder server, restore the swift ring builder files to the /etc/swiftlm/CLOUD-NAME/CONTROL-PLANE-NAME/builder_dir directory. For more information and instructions, see Section 18.6.2.4, “Identifying the Swift Ring Building Server” and Section 18.6.2.7, “Recovering swift Builder Files”.

    4. Configure services on the node using the ardana-deploy.yml playbook. If you have used an encryption password when running the configuration processor, include the --ask-vault-pass argument.

      ardana > ansible-playbook -i hosts/verb_hosts ardana-deploy.yml --ask-vault-pass --limit <hostname>
15.1.5.1.6 Replacing Drives in a swift Node

Maintenance steps for replacing drives in a swift node.

This process is used when you want to remove a failed hard drive from swift node and replace it with a new one.

There are two different classes of drives in a swift node that needs to be replaced; the operating system disk drive (generally /dev/sda) and storage disk drives. There are different procedures for the replacement of each class of drive to bring the node back to normal.

15.1.5.1.6.1 To Replace the Operating System Disk Drive

After the operating system disk drive is replaced, the node must be reimaged.

  1. Log in to the Cloud Lifecycle Manager.

  2. Update your Cobbler profile:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/localhost cobbler-deploy.yml
  3. Reimage the node using this playbook:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<server name>

    In the example below swobj2 server is reimaged:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=swobj2
  4. Review the cloudConfig.yml and data/control_plane.yml files to get the host prefix (for example, openstack) and the control plane name (for example, cp1). This gives you the hostname of the node. Configure the operating system:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <hostname>

    In the following example, for swobj2, the hostname is ardana-cp1-swobj0002:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts osconfig-run.yml -limit ardana-cp1-swobj0002*
  5. If this is the first server running the swift-proxy service, restore the swift Ring Builder files to the /etc/swiftlm/CLOUD-NAME/CONTROL-PLANE-NAME/builder_dir directory. For more information and instructions, see Section 18.6.2.4, “Identifying the Swift Ring Building Server” and Section 18.6.2.7, “Recovering swift Builder Files”.

  6. Configure services on the node using the ardana-deploy.yml playbook. If you have used an encryption password when running the configuration processor include the --ask-vault-pass argument.

    ansible-playbook -i hosts/verb_hosts ardana-deploy.yml --ask-vault-pass \
      --limit <hostname>

    For example:

    ansible-playbook -i hosts/verb_hosts ardana-deploy.yml --ask-vault-pass --limit ardana-cp1-swobj0002*
15.1.5.1.6.2 To Replace a Storage Disk Drive

After a storage drive is replaced, there is no need to reimage the server. Instead, run the swift-reconfigure.yml playbook.

  1. Log onto the Cloud Lifecycle Manager.

  2. Run the following commands:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml --limit <hostname>

    In following example, the server used is swobj2:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml --limit ardana-cp1-swobj0002-mgmt

15.1.6 Updating MariaDB with Galera

Updating MariaDB with Galera must be done manually. Updates are not installed automatically. This is particularly an issue with upgrades to MariaDB 10.2.17 or higher from MariaDB 10.2.16 or earlier. See MariaDB 10.2.22 Release Notes - Notable Changes.

Using the CLI, update MariaDB with the following procedure:

  1. Mark Galera as unmanaged:

    crm resource unmanage galera

    Or put the whole cluster into maintenance mode:

    crm configure property maintenance-mode=true
  2. Pick a node other than the one currently targeted by the loadbalancer and stop MariaDB on that node:

    crm_resource --wait --force-demote -r galera -V
  3. Perform updates:

    1. Uninstall the old versions of MariaDB and the Galera wsrep provider.

    2. Install the new versions of MariaDB and the Galera wsrep provider. Select the appropriate instructions at Installing MariaDB with zypper.

    3. Change configuration options if necessary.

  4. Start MariaDB on the node.

    crm_resource --wait --force-promote -r galera -V
  5. Run mysql_upgrade with the --skip-write-binlog option.

  6. On the other nodes, repeat the process detailed above: stop MariaDB, perform updates, start MariaDB, run mysql_upgrade.

  7. Mark Galera as managed:

    crm resource manage galera

    Or take the cluster out of maintenance mode.

15.2 Unplanned System Maintenance

Unplanned maintenance tasks for your cloud.

15.2.1 Whole Cloud Recovery Procedures

Unplanned maintenance procedures for your whole cloud.

15.2.1.1 Full Disaster Recovery

In this disaster scenario, you have lost everything in your cloud. In other words, you have lost access to all data stored in the cloud that was not backed up to an external backup location, including:

  • Data in swift object storage

  • glance images

  • cinder volumes

  • Metering, Monitoring, and Logging (MML) data

  • Workloads running on compute resources

In effect, the following recovery process creates a minimal new cloud with the existing identity information. Much of the operating state and data would have been lost, as would running workloads.

Important
Important

We recommend backups external to your cloud for your data, including as much as possible of the types of resources listed above. Most workloads that were running could possibly be recreated with sufficient external backups.

15.2.1.1.1 Install and Set Up a Cloud Lifecycle Manager Node

Before beginning the process of a full cloud recovery, you need to install and set up a Cloud Lifecycle Manager node as though you are creating a new cloud. There are several steps in that process:

  1. Install the appropriate version of SUSE Linux Enterprise Server

  2. Restore passwd, shadow, and group files. They have User ID (UID) and group ID (GID) content that will be used to set up the new cloud. If these are not restored immediately after installing the operating system, the cloud deployment will create new UIDs and GIDs, overwriting the existing content.

  3. Install Cloud Lifecycle Manager software

  4. Prepare the Cloud Lifecycle Manager, which includes installing the necessary packages

  5. Initialize the Cloud Lifecycle Manager

  6. Restore your OpenStack git repository

  7. Adjust input model settings if the hardware setup has changed

The following sections cover these steps in detail.

15.2.1.1.2 Install the Operating System

Follow the instructions for installing SUSE Linux Enterprise Server in Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 15 “Installing the Cloud Lifecycle Manager server”.

15.2.1.1.3 Restore files with UID and GID content
Important
Important

There is a risk that you may lose data completely. Restore the backups for /etc/passwd, /etc/shadow, and /etc/group immediately after installing SUSE Linux Enterprise Server.

Some backup files contain content that would no longer be valid if your cloud were to be freshly deployed in the next step of a whole cloud recovery. As a result, some of the backup must be restored before deploying a new cloud. Three kinds of backups are involved: passwd, shadow, and group. The following steps will restore those backups.

  1. Log in to the server where the Cloud Lifecycle Manager will be installed.

  2. Retrieve the Cloud Lifecycle Manager backups from the remote server, which were created and saved during Procedure 17.1, “Manual Backup Setup”.

    ardana > scp USER@REMOTE_SERVER:TAR_ARCHIVE
  3. Untar the TAR archives to overwrite the three locations:

    • passwd

    • shadow

    • group

    ardana > sudo tar -z --incremental --extract --ignore-zeros \
    --warning=none --overwrite --directory RESTORE_TARGET -f BACKUP_TARGET.tar.gz

    The following are examples. Use the actual tar.gz file names of the backups.

    BACKUP_TARGET=/etc/passwd

    ardana > sudo tar -z --incremental --extract --ignore-zeros \
    --warning=none --overwrite --directory /etc/ -f passwd.tar.gz

    BACKUP_TARGET=/etc/shadow

    ardana > sudo tar -z --incremental --extract --ignore-zeros \
    --warning=none --overwrite --directory /etc/ -f shadow.tar.gz

    BACKUP_TARGET=/etc/group

    ardana > sudo tar -z --incremental --extract --ignore-zeros \
    --warning=none --overwrite --directory /etc/ -f group.tar.gz
15.2.1.1.4 Install the Cloud Lifecycle Manager

To ensure that you use the same version of SUSE OpenStack Cloud that was previously loaded on your Cloud Lifecycle Manager, download and install the Cloud Lifecycle Manager software using the instructions from Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 15 “Installing the Cloud Lifecycle Manager server”, Section 15.5.2 “Installing the SUSE OpenStack Cloud Extension”.

15.2.1.1.5 Prepare to deploy your cloud

The following is the general process for preparing to deploy a SUSE OpenStack Cloud. You may not need to perform all the steps, depending on your particular disaster recovery situation.

Important
Important

When you install the ardana cloud pattern in the following process, the ardana user and ardana group will already exist in /etc/passwd and /etc/group. Do not re-create them.

When you run ardana-init in the following process, /var/lib/ardana is created as a deployer account using the account settings in /etc/passwd and /etc/group that were restored in the previous step.

15.2.1.1.5.1 Prepare for Cloud Installation
  1. Review the Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 14 “Pre-Installation Checklist” about recommended pre-installation tasks.

  2. Prepare the Cloud Lifecycle Manager node. The Cloud Lifecycle Manager must be accessible either directly or via ssh, and have SUSE Linux Enterprise Server 12 SP4 installed. All nodes must be accessible to the Cloud Lifecycle Manager. If the nodes do not have direct access to online Cloud subscription channels, the Cloud Lifecycle Manager node will need to host the Cloud repositories.

    1. If you followed the installation instructions for Cloud Lifecycle Manager server (see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 15 “Installing the Cloud Lifecycle Manager server”), SUSE OpenStack Cloud software should already be installed. Double-check whether SUSE Linux Enterprise and SUSE OpenStack Cloud are properly registered at the SUSE Customer Center by starting YaST and running Software › Product Registration.

      If you have not yet installed SUSE OpenStack Cloud, do so by starting YaST and running Software › Product Registration › Select Extensions. Choose SUSE OpenStack Cloud and follow the on-screen instructions. Make sure to register SUSE OpenStack Cloud during the installation process and to install the software pattern patterns-cloud-ardana.

      tux > sudo zypper -n in patterns-cloud-ardana
    2. Ensure the SUSE OpenStack Cloud media repositories and updates repositories are made available to all nodes in your deployment. This can be accomplished either by configuring the Cloud Lifecycle Manager server as an SMT mirror as described in Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 16 “Installing and Setting Up an SMT Server on the Cloud Lifecycle Manager server (Optional)” or by syncing or mounting the Cloud and updates repositories to the Cloud Lifecycle Manager server as described in Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 17 “Software Repository Setup”.

    3. Configure passwordless sudo for the user created when setting up the node (as described in Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 15 “Installing the Cloud Lifecycle Manager server”, Section 15.4 “Creating a User”). Note that this is not the user ardana that will be used later in this procedure. In the following we assume you named the user cloud. Run the command visudo as user root and add the following line to the end of the file:

      CLOUD ALL = (root) NOPASSWD:ALL

      Make sure to replace CLOUD with your user name choice.

    4. Set the password for the user ardana:

      tux > sudo passwd ardana
    5. Become the user ardana:

      tux > su - ardana
    6. Place a copy of the SUSE Linux Enterprise Server 12 SP4 .iso in the ardana home directory, var/lib/ardana, and rename it to sles12sp4.iso.

    7. Install the templates, examples, and working model directories:

      ardana > /usr/bin/ardana-init
15.2.1.1.6 Restore the remaining Cloud Lifecycle Manager content from a remote backup
  1. Log in to the Cloud Lifecycle Manager.

  2. Retrieve the Cloud Lifecycle Manager backups from the remote server, which were created and saved during Procedure 17.1, “Manual Backup Setup”.

    ardana > scp USER@REMOTE_SERVER:TAR_ARCHIVE
  3. Untar the TAR archives to overwrite the remaining four required locations:

    • home

    • ssh

    ardana > sudo tar -z --incremental --extract --ignore-zeros \
    --warning=none --overwrite --directory RESTORE_TARGET -f BACKUP_TARGET.tar.gz

    The following are examples. Use the actual tar.gz file names of the backups.

    BACKUP_TARGET=/var/lib/ardana

    ardana > sudo tar -z --incremental --extract --ignore-zeros \
    --warning=none --overwrite --directory /var/lib/ -f home.tar.gz

    BACKUP_TARGET=/etc/ssh/

    ardana > sudo tar -z --incremental --extract --ignore-zeros \
    --warning=none --overwrite --directory /etc/ssh/ -f ssh.tar.gz
15.2.1.1.7 Re-deployment of controllers 1, 2 and 3
  1. Change back to the default ardana user.

  2. Run the cobbler-deploy.yml playbook.

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost cobbler-deploy.yml
  3. Run the bm-reimage.yml playbook limited to the second and third controllers.

    ardana > ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=controller2,controller3

    The names of controller2 and controller3. Use the bm-power-status.yml playbook to check the cobbler names of these nodes.

  4. Run the site.yml playbook limited to the three controllers and localhost—in this example, doc-cp1-c1-m1-mgmt, doc-cp1-c1-m2-mgmt, doc-cp1-c1-m3-mgmt, and localhost

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts site.yml --limit \
    doc-cp1-c1-m1-mgmt,doc-cp1-c1-m2-mgmt,doc-cp1-c1-m3-mgmt,localhost
  5. You can now perform the procedures to restore MariaDB and swift.

15.2.1.1.8 Restore MariaDB from a remote backup
  1. Log in to the first node running the MariaDB service.

  2. Retrieve the MariaDB backup that was created with Section 17.3.2, “MariaDB Database Backup”.

  3. Create a temporary directory and extract the TAR archive (for example, mydb.tar.gz).

    ardana > mkdir /tmp/mysql_restore; sudo tar -z --incremental \
    --extract --ignore-zeros --warning=none --overwrite --directory /tmp/mysql_restore/ \
    -f mydb.tar.gz
  4. Verify that the files have been restored on the controller.

    ardana > sudo du -shx /tmp/mysql_restore/*
    16K     /tmp/mysql_restore/aria_log.00000001
    4.0K    /tmp/mysql_restore/aria_log_control
    3.4M    /tmp/mysql_restore/barbican
    8.0K    /tmp/mysql_restore/ceilometer
    4.2M    /tmp/mysql_restore/cinder
    2.9M    /tmp/mysql_restore/designate
    129M    /tmp/mysql_restore/galera.cache
    2.1M    /tmp/mysql_restore/glance
    4.0K    /tmp/mysql_restore/grastate.dat
    4.0K    /tmp/mysql_restore/gvwstate.dat
    2.6M    /tmp/mysql_restore/heat
    752K    /tmp/mysql_restore/horizon
    4.0K    /tmp/mysql_restore/ib_buffer_pool
    76M     /tmp/mysql_restore/ibdata1
    128M    /tmp/mysql_restore/ib_logfile0
    128M    /tmp/mysql_restore/ib_logfile1
    12M     /tmp/mysql_restore/ibtmp1
    16K     /tmp/mysql_restore/innobackup.backup.log
    313M    /tmp/mysql_restore/keystone
    716K    /tmp/mysql_restore/magnum
    12M     /tmp/mysql_restore/mon
    8.3M    /tmp/mysql_restore/monasca_transform
    0       /tmp/mysql_restore/multi-master.info
    11M     /tmp/mysql_restore/mysql
    4.0K    /tmp/mysql_restore/mysql_upgrade_info
    14M     /tmp/mysql_restore/nova
    4.4M    /tmp/mysql_restore/nova_api
    14M     /tmp/mysql_restore/nova_cell0
    3.6M    /tmp/mysql_restore/octavia
    208K    /tmp/mysql_restore/opsconsole
    38M     /tmp/mysql_restore/ovs_neutron
    8.0K    /tmp/mysql_restore/performance_schema
    24K     /tmp/mysql_restore/tc.log
    4.0K    /tmp/mysql_restore/test
    8.0K    /tmp/mysql_restore/winchester
    4.0K    /tmp/mysql_restore/xtrabackup_galera_info
  5. Stop SUSE OpenStack Cloud services on the three controllers (using the hostnames of the controllers in your configuration).

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-stop.yml --limit \
    doc-cp1-c1-m1-mgmt,doc-cp1-c1-m2-mgmt,doc-cp1-c1-m3-mgmt,localhost
  6. Delete the files in the mysql directory and copy the restored backup to that directory.

    root # cd /var/lib/mysql/
    root # rm -rf ./*
    root # cp -pr /tmp/mysql_restore/* ./
  7. Switch back to the ardana user when the copy is finished.

15.2.1.1.9 Restore swift from a remote backup
  1. Log in to the first swift Proxy (SWF-PRX--first-member) node.

    To find the first swift Proxy node:

    1. On the Cloud Lifecycle Manager

      ardana > cd  ~/scratch/ansible/next/ardana/ansible
      ardana > ansible-playbook -i hosts/verb_hosts swift-status.yml \
      --limit SWF-PRX--first-member

      At the end of the output, you will see something like the following example:

      ...
      Jun 18 20:01:49 ardana-qe102-cp1-c1-m1-mgmt swiftlm-uptime-mon[3985]: 'uptime-mon - INFO : Metric:keystone-get-token:max-latency: 0.679254770279 (at 1529352109.66)'
      Jun 18 20:01:49 ardana-qe102-cp1-c1-m1-mgmt swiftlm-uptime-mon[3985]: 'uptime-mon - INFO : Metric:keystone-get-token:avg-latency: 0.679254770279 (at 1529352109.66)'
      
      PLAY RECAP ********************************************************************
      ardana-qe102-cp1-c1-m1 : ok=12 changed=0 unreachable=0 failed=0```
    2. Find the first node name and its IP address. For example:

      ardana > cat /etc/hosts | grep ardana-qe102-cp1-c1-m1
  2. Retrieve (scp) the swift backup that was created with Section 17.3.3, “swift Ring Backup”.

  3. Create a temporary directory and extract the TAR archive (for example, swring.tar.gz).

    ardana > mkdir /tmp/swift_builder_dir_restore; sudo tar -z \
    --incremental --extract --ignore-zeros --warning=none --overwrite --directory \
    /tmp/swift_builder_dir_restore/  -f swring.tar.gz
  4. Log in to the Cloud Lifecycle Manager.

  5. Stop the swift service:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts swift-stop.yml
  6. Log back in to the first swift Proxy (SWF-PRX--first-member) node, which was determined in Step 1.

  7. Copy the restored files.

    ardana > sudo cp -pr /tmp/swift_builder_dir_restore/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/* \
        /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/

    For example

    ardana > sudo cp -pr /tmp/swift_builder_dir_restore/entry-scale-kvm/control-plane-1/builder_dir/* \
        /etc/swiftlm/entry-scale-kvm/control-plane-1/builder_dir/
  8. Log back in to the Cloud Lifecycle Manager.

  9. Reconfigure the swift service:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml
15.2.1.1.10 Restart SUSE OpenStack Cloud services
  1. Restart the MariaDB database

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts galera-bootstrap.yml

    On the deployer node, execute the galera-bootstrap.yml playbook which will determine the log sequence number, bootstrap the main node, and start the database cluster.

    If this process fails to recover the database cluster, refer to Section 15.2.3.1.2, “Recovering the MariaDB Database”.

  2. Restart SUSE OpenStack Cloud services on the three controllers as in the following example.

    ardana > ansible-playbook -i hosts/verb_hosts ardana-start.yml \
    --limit doc-cp1-c1-m1-mgmt,doc-cp1-c1-m2-mgmt,doc-cp1-c1-m3-mgmt,localhost
  3. Reconfigure SUSE OpenStack Cloud

    ardana > ansible-playbook -i hosts/verb_hosts ardana-reconfigure.yml

15.2.2 Recover Start-up Processes

In this scenario, processes do not start. If those processes are not running, ansible start-up scripts will fail. On the deployer, use Ansible to check status on the control plane servers. The following checks and remedies address common causes of this condition.

  • If disk space is low, determine the cause and remove anything that is no longer needed. Check disk space with the following command:

    ardana > ansible KEY-API -m shell -a 'df -h'
  • Check that Network Time Protocol (NTP) is synchronizing clocks properly with the following command.

    ardana > ansible resources -i hosts/verb_hosts \
    -m shell -a "sudo ntpq -c peers"
  • Check keepalived, the daemon that monitors services or systems and automatically fails over to a standby if problems occur.

    ardana > ansible KEY-API -i hosts/verb_hosts \
    -m shell -a "sudo systemctl status keepalived | head -8"
  • Restart keepalived if necessary.

    1. Check RabbitMQ status first:

      ardana > ansible KEY-API -i hosts/verb_hosts \
      -m shell -a "sudo rabbitmqctl status | head -10"
    2. Restart RabbitMQ if necessary:

      ardana > ansible KEY-API -i hosts/verb_hosts \
      -m shell -a "sudo systemctl start rabbitmq-server"
    3. If RabbitMQ is running, restart keepalived:

      ardana > ansible KEY-API -i hosts/verb_hosts \
      -m shell -a "sudo systemctl restart keepalived"
  • If RabbitMQ is up, is it clustered?

    ardana > ansible KEY-API -i hosts/verb_hosts \
    -m shell -a "sudo rabbitmqctl cluster_status"

    Restart RabbitMQ cluster if necessary:

    ardana > ansible_playbook -i hosts/verb_hosts rabbitmq-start.yml
  • Check Kafka messaging:

    ardana > ansible KEY-API -i hosts/verb_hosts \
    -m shell -a "sudo systemctl status kafka | head -5"
  • Check the Spark framework:

    ardana > ansible KEY-API -i hosts/verb_hosts \
    -m shell -a "sudo systemctl status spark-worker | head -8"
    ardana > ansible KEY-API -i hosts/verb_hosts \
    -m shell -a "sudo systemctl status spark-master | head -8"
  • If necessary, start Spark:

    ardana > ansible-playbook -i hosts/verb_hosts spark-start.yml
    ardana > ansible KEY-API -i hosts/verb_hosts -m shell -a \
    "sudo systemctl start spark-master | head -8"
  • Check Zookeeper centralized service:

    ardana > ansible KEY-API -i hosts/verb_hosts \
    -m shell -a "sudo systemctl status zookeeper| head -8"
  • Check MariaDB:

    ardana > ansible KEY-API -i hosts/verb_hosts
    -m shell -a "sudo mysql -e 'show status;' | grep -e wsrep_incoming_addresses \
    -e wsrep_local_state_comment "

15.2.3 Unplanned Control Plane Maintenance

Unplanned maintenance tasks for controller nodes such as recovery from power failure.

15.2.3.1 Restarting Controller Nodes After a Reboot

Steps to follow if one or more of your controller nodes lose network connectivity or power, which includes if the node is either rebooted or needs hardware maintenance.

When a controller node is rebooted, needs hardware maintenance, loses network connectivity or loses power, these steps will help you recover the node.

These steps may also be used if the Host Status (ping) alarm is triggered for one or more of your controller nodes.

15.2.3.1.1 Prerequisites

The following conditions must be true in order to perform these steps successfully:

  • Each of your controller nodes should be powered on.

  • Each of your controller nodes should have network connectivity, verified by SSH connectivity from the Cloud Lifecycle Manager to them.

  • The operator who performs these steps will need access to the Cloud Lifecycle Manager.

15.2.3.1.2 Recovering the MariaDB Database

The recovery process for your MariaDB database cluster will depend on how many of your controller nodes need to be recovered. We will cover two scenarios:

Scenario 1: Recovering one or two of your controller nodes but not the entire cluster

Follow these steps to recover one or two of your controller nodes but not the entire cluster, then use these steps:

  1. Ensure the controller nodes have power and are booted to the command prompt.

  2. If the MariaDB service is not started, start it with this command:

    ardana > sudo service mysql start
  3. If MariaDB fails to start, proceed to the next section which covers the bootstrap process.

Scenario 2: Recovering the entire controller cluster with the bootstrap playbook

If the scenario above failed or if you need to recover your entire control plane cluster, use the process below to recover the MariaDB database.

  1. Make sure no mysqld daemon is running on any node in the cluster before you continue with the steps in this procedure. If there is a mysqld daemon running, then use the command below to shut down the daemon.

    ardana > sudo systemctl stop mysql

    If the mysqld daemon does not go down following the service stop, then kill the daemon using kill -9 before continuing.

  2. On the deployer node, execute the galera-bootstrap.yml playbook which will automatically determine the log sequence number, bootstrap the main node, and start the database cluster.

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts galera-bootstrap.yml
15.2.3.1.3 Restarting Services on the Controller Nodes

From the Cloud Lifecycle Manager you should execute the ardana-start.yml playbook for each node that was brought down so the services can be started back up.

If you have a dedicated (separate) Cloud Lifecycle Manager node you can use this syntax:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts ardana-start.yml --limit=<hostname_of_node>

If you have a shared Cloud Lifecycle Manager/controller setup and need to restart services on this shared node, you can use localhost to indicate the shared node, like this:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts ardana-start.yml --limit=<hostname_of_node>,localhost
Note
Note

If you leave off the --limit switch, the playbook will be run against all nodes.

15.2.3.1.4 Restart the Monitoring Agents

As part of the recovery process, you should also restart the monasca-agent and these steps will show you how:

  1. Log in to the Cloud Lifecycle Manager.

  2. Stop the monasca-agent:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-agent-stop.yml
  3. Restart the monasca-agent:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-agent-start.yml
  4. You can then confirm the status of the monasca-agent with this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-agent-status.yml

15.2.3.2 Recovering the Control Plane

If one or more of your controller nodes has experienced data or disk corruption due to power loss or hardware failure and you need perform disaster recovery, there are several scenarios for recovering your cloud.

Note
Note

If you backed up the Cloud Lifecycle Manager manually after installation (see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 38 “Post Installation Tasks”, you will have a backup copy of /etc/group. When recovering a Cloud Lifecycle Manager node, manually copy the /etc/group file from a backup of the old Cloud Lifecycle Manager.

15.2.3.2.1 Point-in-Time MariaDB Database Recovery

In this scenario, everything is still running (Cloud Lifecycle Manager, cloud controller nodes, and compute nodes) but you want to restore the MariaDB database to a previous state.

15.2.3.2.1.1 Restore MariaDB manually

Follow this procedure to manually restore MariaDB:

  1. Log in to the Cloud Lifecycle Manager.

  2. Stop the MariaDB cluster:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts percona-stop.yml
  3. On all of the nodes running the MariaDB service, which should be all of your controller nodes, run the following command to purge the old database:

    ardana > sudo rm -r /var/lib/mysql/*
  4. On the first node running the MariaDB service restore the backup with the command below. If you have already restored to a temporary directory, copy the files again.

    ardana > sudo cp -pr /tmp/mysql_restore/* /var/lib/mysql
  5. If you need to restore the files manually from SSH, follow these steps:

    1. Log in to the first node running the MariaDB service.

    2. Retrieve the MariaDB backup that was created with Section 17.3.2, “MariaDB Database Backup”.

    3. Create a temporary directory and extract the TAR archive (for example, mydb.tar.gz).

      ardana > mkdir /tmp/mysql_restore; sudo tar -z --incremental \
      --extract --ignore-zeros --warning=none --overwrite --directory /tmp/mysql_restore/ \
      -f mydb.tar.gz
    4. Verify that the files have been restored on the controller.

      ardana > sudo du -shx /tmp/mysql_restore/*
      16K     /tmp/mysql_restore/aria_log.00000001
      4.0K    /tmp/mysql_restore/aria_log_control
      3.4M    /tmp/mysql_restore/barbican
      8.0K    /tmp/mysql_restore/ceilometer
      4.2M    /tmp/mysql_restore/cinder
      2.9M    /tmp/mysql_restore/designate
      129M    /tmp/mysql_restore/galera.cache
      2.1M    /tmp/mysql_restore/glance
      4.0K    /tmp/mysql_restore/grastate.dat
      4.0K    /tmp/mysql_restore/gvwstate.dat
      2.6M    /tmp/mysql_restore/heat
      752K    /tmp/mysql_restore/horizon
      4.0K    /tmp/mysql_restore/ib_buffer_pool
      76M     /tmp/mysql_restore/ibdata1
      128M    /tmp/mysql_restore/ib_logfile0
      128M    /tmp/mysql_restore/ib_logfile1
      12M     /tmp/mysql_restore/ibtmp1
      16K     /tmp/mysql_restore/innobackup.backup.log
      313M    /tmp/mysql_restore/keystone
      716K    /tmp/mysql_restore/magnum
      12M     /tmp/mysql_restore/mon
      8.3M    /tmp/mysql_restore/monasca_transform
      0       /tmp/mysql_restore/multi-master.info
      11M     /tmp/mysql_restore/mysql
      4.0K    /tmp/mysql_restore/mysql_upgrade_info
      14M     /tmp/mysql_restore/nova
      4.4M    /tmp/mysql_restore/nova_api
      14M     /tmp/mysql_restore/nova_cell0
      3.6M    /tmp/mysql_restore/octavia
      208K    /tmp/mysql_restore/opsconsole
      38M     /tmp/mysql_restore/ovs_neutron
      8.0K    /tmp/mysql_restore/performance_schema
      24K     /tmp/mysql_restore/tc.log
      4.0K    /tmp/mysql_restore/test
      8.0K    /tmp/mysql_restore/winchester
      4.0K    /tmp/mysql_restore/xtrabackup_galera_info
  6. Log back in to the Cloud Lifecycle Manager.

  7. Start the MariaDB service.

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts galera-bootstrap.yml
  8. After approximately 10-15 minutes, the output of the percona-status.yml playbook should show all the MariaDB nodes in sync. MariaDB cluster status can be checked using this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts percona-status.yml

    An example output is as follows:

    TASK: [FND-MDB | status | Report status of "{{ mysql_service }}"] *************
      ok: [ardana-cp1-c1-m1-mgmt] => {
      "msg": "mysql is synced."
      }
      ok: [ardana-cp1-c1-m2-mgmt] => {
      "msg": "mysql is synced."
      }
      ok: [ardana-cp1-c1-m3-mgmt] => {
      "msg": "mysql is synced."
      }
15.2.3.2.1.2 Point-in-Time Cassandra Recovery

A node may have been removed either due to an intentional action in the Cloud Lifecycle Manager Admin UI or as a result of a fatal hardware event that requires a server to be replaced. In either case, the entry for the failed or deleted node should be removed from Cassandra before a new node is brought up.

The following steps should be taken before enabling and deploying the replacement node.

  1. Determine the IP address of the node that was removed or is being replaced.

  2. On one of the functional Cassandra control plane nodes, log in as the ardana user.

  3. Run the command nodetool status to display a list of Cassandra nodes.

  4. If the node that has been removed (no IP address matches that of the removed node) is not in the list, skip the next step.

  5. If the node that was removed is still in the list, copy its node ID.

  6. Run the command nodetool removenode ID.

After any obsolete node entries have been removed, the replacement node can be deployed as usual (for more information, see Section 15.1.2, “Planned Control Plane Maintenance”). The new Cassandra node will be able to join the cluster and replicate data.

For more information, please consult the Cassandra documentation.

15.2.3.2.2 Point-in-Time swift Rings Recovery

In this situation, everything is still running (Cloud Lifecycle Manager, control plane nodes, and compute nodes) but you want to restore your swift rings to a previous state.

Note
Note

This process restores swift rings only, not swift data.

15.2.3.2.2.1 Restore from a swift backup
  1. Log in to the first swift Proxy (SWF-PRX--first-member) node.

    To find the first swift Proxy node:

    1. On the Cloud Lifecycle Manager

      ardana > cd  ~/scratch/ansible/next/ardana/ansible
      ardana > ansible-playbook -i hosts/verb_hosts swift-status.yml \
      --limit SWF-PRX--first-member

      At the end of the output, you will see something like the following example:

      ...
      Jun 18 20:01:49 ardana-qe102-cp1-c1-m1-mgmt swiftlm-uptime-mon[3985]: 'uptime-mon - INFO : Metric:keystone-get-token:max-latency: 0.679254770279 (at 1529352109.66)'
      Jun 18 20:01:49 ardana-qe102-cp1-c1-m1-mgmt swiftlm-uptime-mon[3985]: 'uptime-mon - INFO : Metric:keystone-get-token:avg-latency: 0.679254770279 (at 1529352109.66)'
      
      PLAY RECAP ********************************************************************
      ardana-qe102-cp1-c1-m1 : ok=12 changed=0 unreachable=0 failed=0```
    2. Find the first node name and its IP address. For example:

      ardana > cat /etc/hosts | grep ardana-qe102-cp1-c1-m1
  2. Retrieve (scp) the swift backup that was created with Section 17.3.3, “swift Ring Backup”.

  3. Create a temporary directory and extract the TAR archive (for example, swring.tar.gz).

    ardana > mkdir /tmp/swift_builder_dir_restore; sudo tar -z \
    --incremental --extract --ignore-zeros --warning=none --overwrite --directory \
    /tmp/swift_builder_dir_restore/  -f swring.tar.gz
  4. Log in to the Cloud Lifecycle Manager.

  5. Stop the swift service:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts swift-stop.yml
  6. Log back in to the first swift Proxy (SWF-PRX--first-member) node, which was determined in Step 1.

  7. Copy the restored files.

    ardana > sudo cp -pr /tmp/swift_builder_dir_restore/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/* \
        /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/

    For example

    ardana > sudo cp -pr /tmp/swift_builder_dir_restore/entry-scale-kvm/control-plane-1/builder_dir/* \
        /etc/swiftlm/entry-scale-kvm/control-plane-1/builder_dir/
  8. Log back in to the Cloud Lifecycle Manager.

  9. Reconfigure the swift service:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml
15.2.3.2.3 Point-in-time Cloud Lifecycle Manager Recovery

In this scenario, everything is still running (Cloud Lifecycle Manager, controller nodes, and compute nodes) but you want to restore the Cloud Lifecycle Manager to a previous state.

Procedure 15.1: Restoring from a Swift or SSH Backup
  1. Log in to the Cloud Lifecycle Manager.

  2. Retrieve the Cloud Lifecycle Manager backups that were created with Section 17.3.1, “Cloud Lifecycle Manager Data Backup”. There are multiple backups; directories are handled differently than files.

  3. Extract the TAR archives for each of the seven locations.

    ardana > sudo tar -z --incremental --extract --ignore-zeros
        --warning=none --overwrite --directory
        RESTORE_TARGET -f
        BACKUP_TARGET.tar.gz

    For example, with a directory such as BACKUP_TARGET=/etc/ssh/

    ardana > sudo tar -z --incremental --extract --ignore-zeros \
    --warning=none --overwrite --directory /etc/ssh/ -f ssh.tar.gz

    With a file such as BACKUP_TARGET=/etc/passwd

    ardana > sudo tar -z --incremental --extract --ignore-zeros --warning=none --overwrite --directory /etc/ -f passwd.tar.gz
15.2.3.2.4 Cloud Lifecycle Manager Disaster Recovery

In this scenario everything is still running (controller nodes and compute nodes) but you have lost either a dedicated Cloud Lifecycle Manager or a shared Cloud Lifecycle Manager/controller node.

To ensure that you use the same version of SUSE OpenStack Cloud that was previously loaded on your Cloud Lifecycle Manager, download and install the Cloud Lifecycle Manager software using the instructions from Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 15 “Installing the Cloud Lifecycle Manager server”, Section 15.5.2 “Installing the SUSE OpenStack Cloud Extension” before proceeding.

Prepare the Cloud Lifecycle Manager following the steps in the Before You Start section of Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 21 “Installing with the Install UI”.

15.2.3.2.4.1 Restore from a remote backup
  1. Log in to the Cloud Lifecycle Manager.

  2. Retrieve (with scp) the Cloud Lifecycle Manager backups that were created with Section 17.3.1, “Cloud Lifecycle Manager Data Backup”. There are multiple backups; directories are handled differently than files.

  3. Extract the TAR archives for each of the seven locations.

    ardana > sudo tar -z --incremental --extract --ignore-zeros
        --warning=none --overwrite --directory
        RESTORE_TARGET -f
        BACKUP_TARGET.tar.gz

    For example, with a directory such as BACKUP_TARGET=/etc/ssh/

    ardana > sudo tar -z --incremental --extract --ignore-zeros \
    --warning=none --overwrite --directory /etc/ssh/ -f ssh.tar.gz

    With a file such as BACKUP_TARGET=/etc/passwd

    ardana > sudo tar -z --incremental --extract --ignore-zeros --warning=none --overwrite --directory /etc/ -f passwd.tar.gz
  4. Update your deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready_deployment.yml
  5. When the Cloud Lifecycle Manager is restored, re-run the deployment to ensure the Cloud Lifecycle Manager is in the correct state:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts site.yml --limit localhost
15.2.3.2.5 One or Two Controller Node Disaster Recovery

This scenario makes the following assumptions:

  • Your Cloud Lifecycle Manager is still intact and working.

  • One or two of your controller nodes went down, but not the entire cluster.

  • The node needs to be rebuilt from scratch, not simply rebooted.

15.2.3.2.5.1 Steps to recovering one or two controller nodes
  1. Ensure that your node has power and all of the hardware is functioning.

  2. Log in to the Cloud Lifecycle Manager.

  3. Verify that all of the information in your ~/openstack/my_cloud/definition/data/servers.yml file is correct for your controller node. You may need to replace the existing information if you had to either replacement your entire controller node or just pieces of it.

  4. If you made changes to your servers.yml file then commit those changes to your local git:

    ardana > git add -A
    ardana > git commit -a -m "editing controller information"
  5. Run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  6. Update your deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  7. Ensure that Cobbler has the correct system information:

    1. If you replaced your controller node with a completely new machine, you need to verify that Cobbler has the correct list of controller nodes:

      ardana > sudo cobbler system list
    2. Remove any controller nodes from Cobbler that no longer exist:

      ardana > sudo cobbler system remove --name=<node>
    3. Add the new node into Cobbler:

      ardana > cd ~/openstack/ardana/ansible
      ardana > ansible-playbook -i hosts/localhost cobbler-deploy.yml
  8. Then you can image the node:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<node_name>
    Note
    Note

    If you do not know the <node name> already, you can get it by using sudo cobbler system list.

    Before proceeding, look at info/server_info.yml to see if the assignment of the node you have added is what you expect. It may not be, as nodes will not be numbered consecutively if any have previously been removed. To prevent loss of data, the configuration processor retains data about removed nodes and keeps their ID numbers from being reallocated. For more information about how this works, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 7 “Other Topics”, Section 7.3 “Persisted Data”, Section 7.3.1 “Persisted Server Allocations”.

  9. Run the wipe_disks.yml playbook to ensure the non-OS partitions on your nodes are completely wiped prior to continuing with the installation.

    Important
    Important

    The wipe_disks.yml playbook is only meant to be run on systems immediately after running bm-reimage.yml. If used for any other situation, it may not wipe all of the expected partitions.

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts wipe_disks.yml --limit <controller_node_hostname>
  10. Complete the rebuilding of your controller node with the two playbooks below:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts osconfig-run.yml -e rebuild=True --limit=<controller_node_hostname>
    ardana > ansible-playbook -i hosts/verb_hosts ardana-deploy.yml -e rebuild=True --limit=<controller_node_hostname>
15.2.3.2.6 Three Control Plane Node Disaster Recovery

In this scenario, all control plane nodes are down and need to be rebuilt or replaced. Restoring from a swift backup is not possible because swift is gone.

15.2.3.2.6.1 Restore from an SSH backup
  1. Log in to the Cloud Lifecycle Manager.

  2. Deploy the control plane nodes, using the values for your control plane node hostnames:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts site.yml --limit \
      CONTROL_PLANE_HOSTNAME1,CONTROL_PLANE_HOSTNAME2, \
      CONTROL_PLANE_HOSTNAME3 -e rebuild=True

    For example, if you were using the default values from the example model files, the command would look like this:

    ardana > ansible-playbook -i hosts/verb_hosts site.yml \
    --limit ardana-ccp-c1-m1-mgmt,ardana-ccp-c1-m2-mgmt,ardana-ccp-c1-m3-mgmt \
    -e rebuild=True
    Note
    Note

    The -e rebuild=True is only used on a single control plane node when there are other controllers available to pull configuration data from. This causes the MariaDB database to be reinitialized, which is the only choice if there are no additional control nodes.

  3. Log in to the Cloud Lifecycle Manager.

  4. Stop MariaDB:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts percona-stop.yml
  5. Retrieve the MariaDB backup that was created with Section 17.3.2, “MariaDB Database Backup”.

  6. Create a temporary directory and extract the TAR archive (for example, mydb.tar.gz).

    ardana > mkdir /tmp/mysql_restore; sudo tar -z --incremental \
    --extract --ignore-zeros --warning=none --overwrite --directory /tmp/mysql_restore/ \
    -f mydb.tar.gz
  7. Verify that the files have been restored on the controller.

    ardana > sudo du -shx /tmp/mysql_restore/*
    16K     /tmp/mysql_restore/aria_log.00000001
    4.0K    /tmp/mysql_restore/aria_log_control
    3.4M    /tmp/mysql_restore/barbican
    8.0K    /tmp/mysql_restore/ceilometer
    4.2M    /tmp/mysql_restore/cinder
    2.9M    /tmp/mysql_restore/designate
    129M    /tmp/mysql_restore/galera.cache
    2.1M    /tmp/mysql_restore/glance
    4.0K    /tmp/mysql_restore/grastate.dat
    4.0K    /tmp/mysql_restore/gvwstate.dat
    2.6M    /tmp/mysql_restore/heat
    752K    /tmp/mysql_restore/horizon
    4.0K    /tmp/mysql_restore/ib_buffer_pool
    76M     /tmp/mysql_restore/ibdata1
    128M    /tmp/mysql_restore/ib_logfile0
    128M    /tmp/mysql_restore/ib_logfile1
    12M     /tmp/mysql_restore/ibtmp1
    16K     /tmp/mysql_restore/innobackup.backup.log
    313M    /tmp/mysql_restore/keystone
    716K    /tmp/mysql_restore/magnum
    12M     /tmp/mysql_restore/mon
    8.3M    /tmp/mysql_restore/monasca_transform
    0       /tmp/mysql_restore/multi-master.info
    11M     /tmp/mysql_restore/mysql
    4.0K    /tmp/mysql_restore/mysql_upgrade_info
    14M     /tmp/mysql_restore/nova
    4.4M    /tmp/mysql_restore/nova_api
    14M     /tmp/mysql_restore/nova_cell0
    3.6M    /tmp/mysql_restore/octavia
    208K    /tmp/mysql_restore/opsconsole
    38M     /tmp/mysql_restore/ovs_neutron
    8.0K    /tmp/mysql_restore/performance_schema
    24K     /tmp/mysql_restore/tc.log
    4.0K    /tmp/mysql_restore/test
    8.0K    /tmp/mysql_restore/winchester
    4.0K    /tmp/mysql_restore/xtrabackup_galera_info
  8. Log back in to the first controller node and move the following files:

    ardana > ssh FIRST_CONTROLLER_NODE
    ardana > sudo su
    root # rm -rf /var/lib/mysql/*
    root # cp -pr /tmp/mysql_restore/* /var/lib/mysql/
  9. Log back in to the Cloud Lifecycle Manager and bootstrap MariaDB:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts galera-bootstrap.yml
  10. Verify the status of MariaDB:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts percona-status.yml
15.2.3.2.7 swift Rings Recovery

To recover the swift rings in the event of a disaster, follow the procedure that applies to your situation: either recover the rings with the manual swift backup and restore or use the SSH backup.

15.2.3.2.7.1 Restore from the swift deployment backup

See Section 18.6.2.7, “Recovering swift Builder Files”.

15.2.3.2.7.2 Restore from the SSH backup

In case you have lost all system disks of all object nodes and swift proxy nodes are corrupted, you can recover the rings from a copy of the swift rings was backed up previously. swift data is still available (the disks used by swift still need to be accessible).

Recover the rings with these steps.

  1. Log in to a swift proxy node.

  2. Become root:

    ardana > sudo su
  3. Create the temporary directory for your restored files:

    root # mkdir /tmp/swift_builder_dir_restore/
  4. Retrieve (scp) the swift backup that was created with Section 17.3.3, “swift Ring Backup”.

  5. Create a temporary directory and extract the TAR archive (for example, swring.tar.gz).

    ardana > mkdir /tmp/swift_builder_dir_restore; sudo tar -z \
    --incremental --extract --ignore-zeros --warning=none --overwrite --directory \
    /tmp/swift_builder_dir_restore/  -f swring.tar.gz

    You now have the swift rings in /tmp/swift_builder_dir_restore/

  6. If the SWF-PRX--first-member is already deployed, copy the contents of the restored directory (/tmp/swift_builder_dir_restore/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/) to /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/ on the SWF-PRX--first-member.

  7. Then from the Cloud Lifecycle Manager run:

  8. ardana > sudo cp -pr /tmp/swift_builder_dir_restore/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/* \
    /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/

    For example

    ardana > sudo cp -pr /tmp/swift_builder_dir_restore/entry-scale-kvm/control-plane-1/builder_dir/* \
    /etc/swiftlm/entry-scale-kvm/control-plane-1/builder_dir/
    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml
  9. If the SWF-ACC--first-member is not deployed, from the Cloud Lifecycle Manager run these playbooks:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts guard-deployment.yml
    ardana > ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <SWF-ACC[0]-hostname>
  10. Copy the contents of the restored directory (/tmp/swift_builder_dir_restore/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/) to /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/ on the SWF-ACC[0].

    Create the directories: /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/

    ardana > sudo cp -pr /tmp/swift_builder_dir_restore/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/* \
    /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/

    For example:

    ardana > sudo cp -pr /tmp/swift_builder_dir_restore/entry-scale-kvm/control-plane-1/builder_dir/* \
    /etc/swiftlm/entry-scale-kvm/control-plane-1/builder_dir/
  11. From the Cloud Lifecycle Manager, run the ardana-deploy.yml playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-deploy.yml

15.2.4 Unplanned Compute Maintenance

Unplanned maintenance tasks including recovering compute nodes.

15.2.4.1 Recovering a Compute Node

If one or more of your compute nodes has experienced an issue such as power loss or hardware failure, then you need to perform disaster recovery. Here we provide different scenarios and how to resolve them to get your cloud repaired.

Typical scenarios in which you will need to recover a compute node include the following:

  • The node has failed, either because it has shut down has a hardware failure, or for another reason.

  • The node is working but the nova-compute process is not responding, thus instances are working but you cannot manage them (for example to delete, reboot, and attach/detach volumes).

  • The node is fully operational but monitoring indicates a potential issue (such as disk errors) that require down time to fix.

15.2.4.1.1 What to do if your compute node is down

Compute node has power but is not powered on

If your compute node has power but is not powered on, use these steps to restore the node:

  1. Log in to the Cloud Lifecycle Manager.

  2. Obtain the name for your compute node in Cobbler:

    ardana > sudo cobbler system list
  3. Power the node back up with this playbook, specifying the node name from Cobbler:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost bm-power-up.yml -e nodelist=<node name>

Compute node is powered on but services are not running on it

If your compute node is powered on but you are unsure if services are running, you can use these steps to ensure that they are running:

  1. Log in to the Cloud Lifecycle Manager.

  2. Confirm the status of the compute service on the node with this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts nova-status.yml --limit <hostname>
  3. You can start the compute service on the node with this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts nova-start.yml --limit <hostname>
15.2.4.1.2 Scenarios involving disk failures on your compute nodes

Your compute nodes should have a minimum of two disks, one that is used for the operating system and one that is used as the data disk. These are defined during the installation of your cloud, in the ~/openstack/my_cloud/definition/data/disks_compute.yml file on the Cloud Lifecycle Manager. The data disk(s) are where the nova-compute service lives. Recovery scenarios will depend on whether one or the other, or both, of these disks experienced failures.

If your operating system disk failed but the data disk(s) are okay

If you have had issues with the physical volume that nodes your operating system you need to ensure that your physical volume is restored and then you can use the following steps to restore the operating system:

  1. Log in to the Cloud Lifecycle Manager.

  2. Source the administrator credentials:

    ardana > source ~/service.osrc
  3. Obtain the hostname for your compute node, which you will use in subsequent commands when <hostname> is requested:

    ardana > openstack host list | grep compute
  4. Obtain the status of the nova-compute service on that node:

    ardana > openstack compute service list --host <hostname>
  5. You will likely want to disable provisioning on that node to ensure that nova-scheduler does not attempt to place any additional instances on the node while you are repairing it:

    ardana > openstack compute service set –disable --reason "node is being rebuilt" <hostname>
  6. Obtain the status of the instances on the compute node:

    ardana > openstack server list --host <hostname> --all-tenants
  7. Before continuing, you should either evacuate all of the instances off your compute node or shut them down. If the instances are booted from volumes, then you can use the nova evacuate or nova host-evacuate commands to do this. See Section 15.1.3.3, “Live Migration of Instances” for more details on how to do this.

    If your instances are not booted from volumes, you will need to stop the instances using the openstack server stop command. Because the nova-compute service is not running on the node you will not see the instance status change, but the Task State for the instance should change to powering-off.

    ardana > openstack server stop <instance_uuid>

    Verify the status of each of the instances using these commands, verifying the Task State states powering-off:

    ardana > openstack server list --host <hostname> --all-tenants
    ardana > openstack server show <instance_uuid>
  8. At this point you should be ready with a functioning hard disk in the node that you can use for the operating system. Follow these steps:

    1. Obtain the name for your compute node in Cobbler, which you will use in subsequent commands when <node_name> is requested:

      ardana > sudo cobbler system list
    2. Run the following playbook, ensuring that you specify only your UEFI SLES nodes using the nodelist. This playbook will reconfigure Cobbler for the nodes listed.

      ardana > cd ~/scratch/ansible/next/ardana/ansible
      ardana > ansible-playbook prepare-sles-grub2.yml -e nodelist=node1[,node2,node3]
    3. Reimage the compute node with this playbook:

      ardana > cd ~/openstack/ardana/ansible
      ardana > ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<node name>
  9. Once reimaging is complete, use the following playbook to configure the operating system and start up services:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts site.yml --limit <hostname>
  10. You should then ensure any instances on the recovered node are in an ACTIVE state. If they are not then use the openstack server start command to bring them to the ACTIVE state:

    ardana > openstack server list --host <hostname> --all-tenants
    ardana > openstack server start <instance_uuid>
  11. Re-enable provisioning:

    ardana > openstack compute service set –enable <hostname>
  12. Start any instances that you had stopped previously:

    ardana > openstack server list --host <hostname> --all-tenants
    ardana > openstack server start <instance_uuid>

If your data disk(s) failed but the operating system disk is okay OR if all drives failed

In this scenario your instances on the node are lost. First, follow steps 1 to 5 and 8 to 9 in the previous scenario.

After that is complete, use the openstack server rebuild command to respawn your instances, which will also ensure that they receive the same IP address:

ardana > openstack server list --host <hostname> --all-tenants
ardana > openstack server rebuild <instance_uuid>

15.2.5 Unplanned Storage Maintenance

Unplanned maintenance tasks for storage nodes.

15.2.5.1 Unplanned swift Storage Maintenance

Unplanned maintenance tasks for swift storage nodes.

15.2.5.1.1 Recovering a Swift Node

If one or more of your swift Object or PAC nodes has experienced an issue, such as power loss or hardware failure, and you need to perform disaster recovery then we provide different scenarios and how to resolve them to get your cloud repaired.

Typical scenarios in which you will need to repair a swift object or PAC node include:

  • The node has either shut down or been rebooted.

  • The entire node has failed and needs to be replaced.

  • A disk drive has failed and must be replaced.

15.2.5.1.1.1 What to do if your Swift host has shut down or rebooted

If your swift host has power but is not powered on, from the lifecycle manager you can run this playbook:

  1. Log in to the Cloud Lifecycle Manager.

  2. Obtain the name for your swift host in Cobbler:

    sudo cobbler system list
  3. Power the node back up with this playbook, specifying the node name from Cobbler:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost bm-power-up.yml -e nodelist=<node name>

Once the node is booted up, swift should start automatically. You can verify this with this playbook:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts swift-status.yml

Any alarms that have triggered due to the host going down should clear within 10 minutes. See Section 18.1.1, “Alarm Resolution Procedures” if further assistance is needed with the alarms.

15.2.5.1.1.2 How to replace your Swift node

If your swift node has irreparable damage and you need to replace the entire node in your environment, see Section 15.1.5.1.5, “Replacing a swift Node” for details on how to do this.

15.2.5.1.1.3 How to replace a hard disk in your Swift node

If you need to do a hard drive replacement in your swift node, see Section 15.1.5.1.6, “Replacing Drives in a swift Node” for details on how to do this.

15.3 Cloud Lifecycle Manager Maintenance Update Procedure

Procedure 15.2: Preparing for Update
  1. Ensure that the update repositories have been properly set up on all nodes. The easiest way to provide the required repositories on the Cloud Lifecycle Manager Server is to set up an SMT server as described in Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 16 “Installing and Setting Up an SMT Server on the Cloud Lifecycle Manager server (Optional)”. Alternatives to setting up an SMT server are described in Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 17 “Software Repository Setup”.

  2. Read the Release Notes for the security and maintenance updates that will be installed.

  3. Have a backup strategy in place. For further information, see Chapter 17, Backup and Restore.

  4. Ensure that you have a known starting state by resolving any unexpected alarms.

  5. Determine if you need to reboot your cloud after updating the software. Rebooting is highly recommended to ensure that all affected services are restarted. Reboot may be required after installing Linux kernel updates, but it can be skipped if the impact on running services is non-existent or well understood.

  6. Review steps in Section 15.1.4.1, “Adding a Network Node” and Section 15.1.1.2, “Rolling Reboot of the Cloud” to minimize the impact on existing workloads. These steps are critical when the neutron services are not provided via external SDN controllers.

  7. Before the update, prepare your working loads by consolidating all of your instances to one or more Compute Nodes. After the update is complete on the evacuated Compute Nodes, reboot them and move the images from the remaining Compute Nodes to the newly booted ones. Then, update the remaining Compute Nodes.

15.3.1 Performing the Update

Before you proceed, get the status of all your services:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts ardana-status.yml

If status check returns an error for a specific service, run the SERVICE-reconfigure.yml playbook. Then run the SERVICE-status.yml playbook to check that the issue has been resolved.

Update and reboot all nodes in the cloud one by one. Start with the deployer node, then follow the order recommended in Section 15.1.1.2, “Rolling Reboot of the Cloud”.

Note
Note

The described workflow also covers cases in which the deployer node is also provisioned as an active cloud node.

To minimize the impact on the existing workloads, the node should first be prepared for an update and a subsequent reboot by following the steps leading up to stopping services listed in Section 15.1.1.2, “Rolling Reboot of the Cloud”, such as migrating singleton agents on Control Nodes and evacuating Compute Nodes. Do not stop services running on the node, as they need to be running during the update.

Procedure 15.3: Update Instructions
  1. Install all available security and maintenance updates on the deployer using the zypper patch command.

  2. Initialize the Cloud Lifecycle Manager and prepare the update playbooks.

    1. Run the ardana-init initialization script to update the deployer.

    2. Redeploy cobbler:

      ardana > cd ~/openstack/ardana/ansible
      ardana > ansible-playbook -i hosts/localhost cobbler-deploy.yml
    3. Run the configuration processor:

      ardana > cd ~/openstack/ardana/ansible
      ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
    4. Update your deployment directory:

      ardana > cd ~/openstack/ardana/ansible
      ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  3. Installation and management of updates can be automated with the following playbooks:

    • ardana-update-pkgs.yml

    • ardana-update.yml

    • ardana-update-status.yml

    • ardana-reboot.yml

  4. Confirm version changes by running hostnamectl before and after running the ardana-update-pkgs playbook on each node.

    ardana > hostnamectl

    Notice that Boot ID: and Kernel: have changed.

  5. By default, the ardana-update-pkgs.yml playbook will install patches and updates that do not require a system reboot. Patches and updates that do require a system reboot will be installed later in this process.

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-update-pkgs.yml \
    --limit TARGET_NODE_NAME

    There may be a delay in the playbook output at the following task while updates are pulled from the deployer.

    TASK: [ardana-upgrade-tools | pkg-update | Download and install
    package updates] ***
  6. After running the ardana-update-pkgs.yml playbook to install patches and updates not requiring reboot, check the status of remaining tasks.

    ardana > ansible-playbook -i hosts/verb_hosts ardana-update-status.yml \
    --limit TARGET_NODE_NAME
  7. To install patches that require reboot, run the ardana-update-pkgs.yml playbook with the parameter -e zypper_update_include_reboot_patches=true.

    ardana > ansible-playbook -i hosts/verb_hosts ardana-update-pkgs.yml \
    --limit  TARGET_NODE_NAME \
    -e zypper_update_include_reboot_patches=true

    If the output of ardana-update-pkgs.yml indicates that a reboot is required, run ardana-reboot.yml after completing the ardana-update.yml step below. Running ardana-reboot.yml will cause cloud service interruption.

    Note
    Note

    To update a single package (for example, apply a PTF on a single node or on all nodes), run zypper update PACKAGE.

    To install all package updates using zypper update.

  8. Update services:

    ardana > ansible-playbook -i hosts/verb_hosts ardana-update.yml \
    --limit TARGET_NODE_NAME
  9. If indicated by the ardana-update-status.yml playbook, reboot the node.

    There may also be a warning to reboot after running the ardana-update-pkgs.yml.

    This check can be overridden by setting the SKIP_UPDATE_REBOOT_CHECKS environment variable or the skip_update_reboot_checks Ansible variable.

    ansible-playbook -i hosts/verb_hosts ardana-reboot.yml \
    --limit TARGET_NODE_NAME
  10. To recheck pending system reboot status at a later time, run the following commands:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-update-status.yml \
    --limit ardana-cp1-c1-m2 -e update_status_var=system-reboot
  11. The pending system reboot status can be reset by running:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-update-status.yml \
    --limit ardana-cp1-c1-m2 \
    -e update_status_var=system-reboot \
    -e update_status_reset=true
  12. Multiple servers can be patched at the same time with ardana-update-pkgs.yml by setting the option -e skip_single_host_checks=true.

    Warning
    Warning

    When patching multiple servers at the same time, take care not to compromise HA capability by updating an entire cluster (controller, database, monitor, logging) at the same time.

    If multiple nodes are specified on the command line (with --limit), services on those servers will experience outages as the packages are shutdown and updated. On Compute Nodes (or group of Compute Nodes) migrate the workload off if you plan to update it. The same applies to Control Nodes: move singleton services off of the control plane node that will be updated.

    Important
    Important

    Do not reboot all of your controllers at the same time.

  13. When the node comes up after the reboot, run the spark-start.yml file:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts spark-start.yml
  14. Verify that Spark is running on all Control Nodes:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts spark-status.yml
  15. After all nodes have been updated, check the status of all services:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-status.yml

15.3.2 Summary of the Update Playbooks

ardana-update-pkgs.yml

Top-level playbook automates the installation of package updates on a single node. It also works for multiple nodes, if the single-node restriction is overridden by setting the SKIP_SINGLE_HOST_CHECKS environment variable ardana-update-pkgs.yml -e skip_single_host_checks=true.

Provide the following -e options to modify default behavior:

  • zypper_update_method (default: patch)

    • patch installs all patches for the system. Patches are intended for specific bug and security fixes.

    • update installs all packages that have a higher version number than the installed packages.

    • dist-upgrade replaces each package installed with the version from the repository and deletes packages not available in the repositories.

  • zypper_update_repositories (default: all) restricts the list of repositories used

  • zypper_update_gpg_checks (default: true) enables GPG checks. If set to true, checks if packages are correctly signed.

  • zypper_update_licenses_agree (default: false) automatically agrees with licenses. If set to true, zypper automatically accepts third party licenses.

  • zypper_update_include_reboot_patches (default: false) includes patches that require reboot. Setting this to true installs patches that require a reboot (such as kernel or glibc updates).

ardana-update.yml

Top level playbook that automates the update of all the services. Runs on all nodes by default, or can be limited to a single node by adding --limit nodename.

ardana-reboot.yml

Top-level playbook that automates the steps required to reboot a node. It includes pre-boot and post-boot phases, which can be extended to include additional checks.

ardana-update-status.yml

This playbook can be used to check or reset the update-related status variables maintained by the update playbooks. The main reason for having this mechanism is to allow the update status to be checked at any point during the update procedure. It is also used heavily by the automation scripts to orchestrate installing maintenance updates on multiple nodes.

15.4 Upgrading Cloud Lifecycle Manager 8 to Cloud Lifecycle Manager 9

Before undertaking the upgrade from SUSE OpenStack Cloud (or HOS) 8 Cloud Lifecycle Manager to SUSE OpenStack Cloud 9 Cloud Lifecycle Manager, you need to ensure that your existing SUSE OpenStack Cloud 8 Cloud Lifecycle Manager installation is up to date by following the https://documentation.suse.com/hpe-helion/8/html/hpe-helion-openstack-clm-all/system-maintenance.html#maintenance-update.

Ensure you review the following resources:

To confirm that all nodes have been successfully updated with no pending actions, run the ardana-update-status.yml playbook on the Cloud Lifecycle Manager deployer node as follows:

ardana > cd scratch/ansible/next/ardana/ansible/
ardana > ansible-playbook -i hosts/verb_hosts ardana-update-status.yml
Note
Note

Ensure that all nodes have been updated, and that there are no pending update actions remaining to be completed. In particular, ensure that any nodes that need to be rebooted have been, using the documented reboot procedure.

Procedure 15.4: Running the Pre-Upgrade Validation Checks to Ensure that your Cloud is Ready for Upgrade
  • Once all nodes have been successfully updated, and there are no pending update actions remaining, you should be able to run the ardana-pre-upgrade-validations.sh script, as follows:

    ardana > cd scratch/ansible/next/ardana/ansible/
    ardana > ./ardana-pre-upgrade-validations.sh
    ~/scratch/ansible/next/ardana/ansible ~/scratch/ansible/next/ardana/ansible
    
    PLAY [Initialize an empty list of msgs] ***************************************
    
    TASK: [set_fact ] *************************************************************
    ok: [localhost]
    ...
    
    PLAY RECAP ********************************************************************
    ...
    localhost                  : ok=8    changed=5    unreachable=0    failed=0
    
    msg: Please refer to /var/log/ardana-pre-upgrade-validations.log for the results of this run. Ensure that any messages in the file that have the words FAIL or WARN are resolved.

    The last line of output from the ardana-pre-upgrade-validations.sh script will tell you the name of its log file—in this case, /var/log/ardana-pre-upgrade-validations.log. If you look at the log file, you will see content similar to the following:

    ardana > sudo cat /var/log/ardana-pre-upgrade-validations.log
    ardana-cp-dbmqsw-m1*************************************************************
    NOTE: pre-upgrade-swift-checks: Swift XFS inode size of 256 for /srv/node/disk0 is smaller than SLE 12 SP4 recommended 512. Some recommended XFS data integrity features may not be available after upgrade.
    NOTE: pre-upgrade-swift-checks: Swift XFS inode size of 256 for /srv/node/disk1 is smaller than SLE 12 SP4 recommended 512. Some recommended XFS data integrity features may not be available after upgrade.
    NOTE: pre-upgrade-swift-checks: Swift XFS inode size of 256 for /srv/node/disk2 is smaller than SLE 12 SP4 recommended 512. Some recommended XFS data integrity features may not be available after upgrade.
    
    ardana-cp-dbmqsw-m2*************************************************************
    NOTE: pre-upgrade-swift-checks: Swift XFS inode size of 256 for /srv/node/disk0 is smaller than SLE 12 SP4 recommended 512. Some recommended XFS data integrity features may not be available after upgrade.
    NOTE: pre-upgrade-swift-checks: Swift XFS inode size of 256 for /srv/node/disk1 is smaller than SLE 12 SP4 recommended 512. Some recommended XFS data integrity features may not be available after upgrade.
    NOTE: pre-upgrade-swift-checks: Swift XFS inode size of 256 for /srv/node/disk2 is smaller than SLE 12 SP4 recommended 512. Some recommended XFS data integrity features may not be available after upgrade.
    
    ardana-cp-dbmqsw-m3*************************************************************
    NOTE: pre-upgrade-swift-checks: Swift XFS inode size of 256 for /srv/node/disk0 is smaller than SLE 12 SP4 recommended 512. Some recommended XFS data integrity features may not be available after upgrade.
    NOTE: pre-upgrade-swift-checks: Swift XFS inode size of 256 for /srv/node/disk1 is smaller than SLE 12 SP4 recommended 512. Some recommended XFS data integrity features may not be available after upgrade.
    NOTE: pre-upgrade-swift-checks: Swift XFS inode size of 256 for /srv/node/disk2 is smaller than SLE 12 SP4 recommended 512. Some recommended XFS data integrity features may not be available after upgrade.
    
    ardana-cp-mml-m1****************************************************************
    SUCCESS: Keystone V2 ==> V3 API config changes detected.
    ardana-cp-mml-m2****************************************************************
    SUCCESS: Keystone V2 ==> V3 API config changes detected.
    ardana-cp-mml-m3****************************************************************
    SUCCESS: Keystone V2 ==> V3 API config changes detected.
    localhost***********************************************************************

    The report states the following:

    SUCCESS: Keystone V2 ==> V3 API config changes detected.

    This check confirms that your cloud has been updated with the necessary changes such that all services will be using Keystone V3 API. This means that there should be minimal interruption of service during the upgrade. This is important because the Keystone V2 API has been removed in SUSE OpenStack Cloud 9.

    NOTE: pre-upgrade-swift-checks: Swift XFS inode size of 256 for /srv/node/disk0 is smaller than the SUSE Linux Enterprise 12 SP4 recommendation of 512. Some recommended XFS data integrity features may not be available after upgrade.

    This check will only report something if you have local swift configured and it is formatted with the SUSE Linux Enterprise 12 SP3 default XFS inode size of 256. In SUSE Linux Enterprise 12 SP4, the default XFS inode size for a newly-formatted XFS file system has been increased to 512, to allow room for enabling some additional XFS data-integrity features by default.

Note
Note

There will be no loss of functionality as regards the swift solution after the upgrade. The difference is that some additional XFS features will not be available on file systems which were formatted under SUSE Linux Enterprise 12 SP3 or earlier. These XFS features aid in the detection of, and recovery from, data corruption. They are enabled by default for XFS file systems formatted under SUSE Linux Enterprise 12 SP 4.

Procedure 15.5: Additional Pre-Upgrade Checks That Should Be Performed

In addition to the automated upgrade checks above, there are some checks that should be performed manually.

  1. For each network interface device specified in the input model under ~/openstack/my_cloud/definition, ensure that there is only one untagged VLAN. The SUSE OpenStack Cloud 9 Cloud Lifecycle Manager configuration processor will fail with an error if it detects this problem during the upgrade, so address this problem before starting the upgrade process.

  2. If the deployer node is not a standalone system, but is instead co-located with the DB services, this can lead to potentially longer service disruptions during the upgrade process. To determine if this is the case, check if the deployer node (OPS-LM--first-member) is a member of the database nodes (FND-MDB). You can do this with the following command:

    ardana > cd scratch/ansible/next/ardana/ansible/
    ardana > ansible -i hosts/verb_hosts 'FND-MDB:&OPS-LM--first-member' --list-hosts

    If the output is:

           No hosts matched

    Then the deployer node is not co-located with the database nodes. Otherwise, if the command reports a hostname, then there may be additional interruptions to the database services during the upgrade.

  3. Similarly, if the deployer is co-located with the database services, and you are also trying to run a local SMT service on the deployer node, you will run into issues trying to configure the SMT to enable and mirror the SUSE Linux Enterprise 12 SP4 and SUSE OpenStack Cloud 9 repositories.

    In such cases, it is recommended that you run the SMT services on a different node, and NFS-import the /srv/www/htdocs/repo onto the deployer node, instead of trying to run the SMT services locally.

Note
Note: Backup the Cloud Lifecycle Manager Configuration Settings

The integrated backup solution in SUSE OpenStack Cloud 8 Cloud Lifecycle Manager, freezer, is no longer available in SUSE OpenStack Cloud 9 Cloud Lifecycle Manager. Therefore, we recommend doing a manual backup to a server that is not a member of the cloud, as per Chapter 17, Backup and Restore.

15.4.1 Migrating the Deployer Node Packages

The upgrade process first migrates the SUSE OpenStack Cloud 8 Cloud Lifecycle Manager deployer node to SUSE Linux Enterprise 12 SP4 and the SOC 9 Cloud Lifecycle Manager packages.

Important
Important

If the deployer node is not a dedicated node, but is instead a member of one of the cloud-control planes, then some services may restart with the SUSE Linux Enterprise 12 SP4 and SUSE OpenStack Cloud 9 CLM versions of the software during the migration. This may mean that:

  • Some services fail to restart. This will be resolved when the appropriate SUSE OpenStack Cloud 9 configuration changes are applied by running the ardana-upgrade.yml playbook, later during the upgrade process.

  • Other services may log excessive warnings about connectivity issues and backwards-compatibility warnings. This will be resolved when the relevant services are upgraded during the ardana-upgrade.yml playbook run.

In order to upgrade the deployer node to be based on SUSE OpenStack Cloud 9 Cloud Lifecycle Manager, you first need to migrate the system to SUSE Linux Enterprise 12 SP4 with the SUSE OpenStack Cloud 9 Cloud Lifecycle Manager product installed.

The process for migrating the deployer node differs somewhat, depending on whether your deployer node is registered with the SUSE Customer Center (or an SMT mirror), versus using locally-maintained repositories available at the relevant locations.

If your deployer node is registered with the SUSE Customer Center or an SMT, the migration process requires the zypper-migration-plugin package to be installed.

Procedure 15.6: Migrating an SCC/SMT Registered Deployer Node
  1. If you are using an SMT server to mirror the relevant repositories, then you need to enable mirroring of the relevant repositories. See Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 16 “Installing and Setting Up an SMT Server on the Cloud Lifecycle Manager server (Optional)”, Section 16.3 “Setting up Repository Mirroring on the SMT Server” for more information.

    Ensure that the mirroring process has completed before proceeding.

  2. Ensure that the zypper-migration-plugin package is installed; if not, install it:

    ardana > sudo zypper install zypper-migration-plugin
    Refreshing service 'SMT-http_smt_example_com'.
    Loading repository data...
    Reading installed packages...
    'zypper-migration-plugin' is already installed.
    No update candidate for 'zypper-migration-plugin-0.10-12.4.noarch'. The highest available version is already installed.
    Resolving package dependencies...
    
    Nothing to do.
  3. De-register the SUSE Linux Enterprise Server LTSS 12 SP3 x86_64 extension (if enabled):

    ardana > sudo SUSEConnect --status-text
    Installed Products:
    ------------------------------------------
    
      SUSE Linux Enterprise Server 12 SP3 LTSS
      (SLES-LTSS/12.3/x86_64)
    
      Registered
    
    ------------------------------------------
    
      SUSE Linux Enterprise Server 12 SP3
      (SLES/12.3/x86_64)
    
      Registered
    
    ------------------------------------------
    
      SUSE OpenStack Cloud 8
      (suse-openstack-cloud/8/x86_64)
    
      Registered
    
    ------------------------------------------
    
    
    ardana > sudo SUSEConnect -d -p SLES-LTSS/12.3/x86_64
    Deregistering system from registration proxy https://smt.example.com/
    
    Deactivating SLES-LTSS 12.3 x86_64 ...
    -> Refreshing service ...
    -> Removing release package ...
    ardana > sudo SUSEConnect --status-text
    Installed Products:
    ------------------------------------------
    
      SUSE Linux Enterprise Server 12 SP3
      (SLES/12.3/x86_64)
    
      Registered
    
    ------------------------------------------
    
      SUSE OpenStack Cloud 8
      (suse-openstack-cloud/8/x86_64)
    
      Registered
    
    ------------------------------------------
  4. Disable any other SUSE Linux Enterprise 12 SP3 or SUSE OpenStack Cloud (or HOS) 8 Cloud Lifecycle Manager-related repositories. The zypper migration process should detect and disable most of these automatically, but in some cases it may not catch all of them, which can lead to a minor disruption later during the upgrade procedure. For example, to disable any repositories served from the /srv/www/suse-12.3 directory or the SUSE-12-4 alias under http://localhost:79/, you could use the following commands:

    ardana > zypper repos --show-enabled-only --uri | grep -e dir:///srv/www/suse-12.3 -e http://localhost:79/SUSE-12-4/ | cut -d '|' -f 2
    PTF
    SLES12-SP3-LTSS-Updates
    SLES12-SP3-Pool
    SLES12-SP3-Updates
    SUSE-OpenStack-Cloud-8-Pool
    SUSE-OpenStack-Cloud-8-Updates
    ardana > for repo in $(zypper repos --show-enabled-only --uri | grep -e dir:///srv/www/suse-12.3 -e http://localhost:79/SUSE-12-4/ | cut -d '|' -f 2); do sudo zypper modifyrepo --disable "${repo}"; done
    Repository 'PTF' has been successfully disabled.
    Repository 'SLES12-SP3-LTSS-Updates' has been successfully disabled.
    Repository 'SLES12-SP3-Pool' has been successfully disabled.
    Repository 'SLES12-SP3-Updates' has been successfully disabled.
    Repository 'SUSE-OpenStack-Cloud-8-Pool' has been successfully disabled.
    Repository 'SUSE-OpenStack-Cloud-8-Updates' has been successfully disabled.
  5. Remove the PTF repository, which is based on SUSE Linux Enterprise 12 SP3 (a new one, based on SUSE Linux Enterprise 12 SP4, will be created during the upgrade process):

    ardana > zypper repos | grep PTF
     2 | PTF                                                               | PTF                                      | No      | (r ) Yes  | Yes
    ardana > sudo zypper removerepo PTF
    Removing repository 'PTF' ..............................................................................................[done]
    Repository 'PTF' has been removed.
  6. Remove the Cloud media repository (if defined):

    ardana > zypper repos | grep '[|] Cloud '
     1 | Cloud                          | SUSE OpenStack Cloud 8 DVD #1  | Yes     | (r ) Yes  | No
    ardana > sudo zypper removerepo Cloud
    Removing repository 'SUSE OpenStack Cloud 8 DVD #1' ....................................................................[done]
    Repository 'SUSE OpenStack Cloud 8 DVD #1' has been removed.
  7. Run the zypper migration command, which should offer a single choice: namely, to upgrade to SUSE Linux Enterprise 12 SP4 and SUSE OpenStack Cloud 9 Cloud Lifecycle Manager. You need to accept the offered choice, then answer yes to any prompts to disable obsoleted repositories. At that point, the zypper migration command will run zypper dist-upgrade, which will prompt you to agree with the proposed package changes. Finally, you will to agree with any new licenses. After this, the package upgrade of the deployer node will proceed. The output of the running zypper migration should look something like the following:

    ardana > sudo zypper migration
    
    Executing 'zypper  refresh'
    
    Repository 'SLES12-SP3-Pool' is up to date.
    Repository 'SLES12-SP3-Updates' is up to date.
    Repository 'SLES12-SP3-Pool' is up to date.
    Repository 'SLES12-SP3-Updates' is up to date.
    Repository 'SUSE-OpenStack-Cloud-8-Pool' is up to date.
    Repository 'SUSE-OpenStack-Cloud-8-Updates' is up to date.
    Repository 'OpenStack-Cloud-8-Pool' is up to date.
    Repository 'OpenStack-Cloud-8-Updates' is up to date.
    All repositories have been refreshed.
    
    Executing 'zypper  --no-refresh patch-check --updatestack-only'
    
    Loading repository data...
    Reading installed packages...
    
    0 patches needed (0 security patches)
    
    Available migrations:
    
        1 | SUSE Linux Enterprise Server 12 SP4 x86_64
            SUSE OpenStack Cloud 9 x86_64
    
    
    [num/q]: 1
    
    Executing 'snapper create --type pre --cleanup-algorithm=number --print-number --userdata important=yes --description 'before online migration''
    
    The config 'root' does not exist. Likely snapper is not configured.
    See 'man snapper' for further instructions.
    Upgrading product SUSE Linux Enterprise Server 12 SP4 x86_64.
    Found obsolete repository SLES12-SP3-Updates
    Disable obsolete repository SLES12-SP3-Updates [y/n] (y): y
    ... disabling.
    Found obsolete repository SLES12-SP3-Pool
    Disable obsolete repository SLES12-SP3-Pool [y/n] (y): y
    ... disabling.
    Upgrading product SUSE OpenStack Cloud 9 x86_64.
    Found obsolete repository OpenStack-Cloud-8-Pool
    Disable obsolete repository OpenStack-Cloud-8-Pool [y/n] (y): y
    ... disabling.
    
    Executing 'zypper --releasever 12.4 ref -f'
    
    Warning: Enforced setting: $releasever=12.4
    Forcing raw metadata refresh
    Retrieving repository 'SLES12-SP4-Pool' metadata .......................................................................[done]
    Forcing building of repository cache
    Building repository 'SLES12-SP4-Pool' cache ............................................................................[done]
    Forcing raw metadata refresh
    Retrieving repository 'SLES12-SP4-Updates' metadata ....................................................................[done]
    Forcing building of repository cache
    Building repository 'SLES12-SP4-Updates' cache .........................................................................[done]
    Forcing raw metadata refresh
    Retrieving repository 'SUSE-OpenStack-Cloud-9-Pool' metadata ...........................................................[done]
    Forcing building of repository cache
    Building repository 'SUSE-OpenStack-Cloud-9-Pool' cache ................................................................[done]
    Forcing raw metadata refresh
    Retrieving repository 'SUSE-OpenStack-Cloud-9-Updates' metadata ........................................................[done]
    Forcing building of repository cache
    Building repository 'SUSE-OpenStack-Cloud-9-Updates' cache .............................................................[done]
    Forcing raw metadata refresh
    Retrieving repository 'OpenStack-Cloud-8-Updates' metadata .............................................................[done]
    Forcing building of repository cache
    Building repository 'OpenStack-Cloud-8-Updates' cache ..................................................................[done]
    All repositories have been refreshed.
    
    Executing 'zypper --releasever 12.4  --no-refresh  dist-upgrade --no-allow-vendor-change '
    
    Warning: Enforced setting: $releasever=12.4
    Warning: You are about to do a distribution upgrade with all enabled repositories. Make sure these repositories are compatible before you continue. See 'man zypper' for more information about this command.
    Loading repository data...
    Reading installed packages...
    Computing distribution upgrade...
    
    ...
    
    525 packages to upgrade, 14 to downgrade, 62 new, 5 to remove, 1 to change vendor, 1 to change arch.
    Overall download size: 1.24 GiB. Already cached: 0 B. After the operation, additional 780.8 MiB will be used.
    Continue? [y/n/...? shows all options] (y): y
    ...
        dracut: *** Generating early-microcode cpio image ***
        dracut: *** Constructing GenuineIntel.bin ****
        dracut: *** Store current command line parameters ***
        dracut: Stored kernel commandline:
        dracut:  rd.lvm.lv=ardana-vg/root
        dracut:  root=/dev/mapper/ardana--vg-root rootfstype=ext4 rootflags=rw,relatime,data=ordered
        dracut: *** Creating image file '/boot/initrd-4.4.180-94.127-default' ***
        dracut: *** Creating initramfs image file '/boot/initrd-4.4.180-94.127-default' done ***
    
    Output of btrfsmaintenance-0.2-18.1.noarch.rpm %posttrans script:
        Refresh script btrfs-scrub.sh for monthly
        Refresh script btrfs-defrag.sh for none
        Refresh script btrfs-balance.sh for weekly
        Refresh script btrfs-trim.sh for none
    
    There are some running programs that might use files deleted by recent upgrade. You may wish to check and restart some of them. Run 'zypper ps -s' to list these programs.
Procedure 15.7: Migrating a Deployer Node with Locally-Managed Repositories

In this configuration, you need to manually migrate the system using zypper dist-upgrade, according to the following steps:

  1. Disable any SUSE Linux Enterprise 12 SP3 or SUSE OpenStack Cloud 8 Cloud Lifecycle Manager-related repositories. Leaving the SUSE Linux Enterprise 12 SP3 and/or SUSE OpenStack Cloud (or HOS) 8 Cloud Lifecycle Manager-related repositories enabled can lead to a minor disruption later during the upgrade procedure. For example, to disable any repositories served from the /srv/www/suse-12.3 directory, or the SUSE-12-4 alias under http://localhost:79/, use the following commands:

    ardana > zypper repos --show-enabled-only --uri | grep -e dir:///srv/www/suse-12.3 -e http://localhost:79/SUSE-12-4/ | cut -d '|' -f 2
    PTF
    SLES12-SP3-LTSS-Updates
    SLES12-SP3-Pool
    SLES12-SP3-Updates
    SUSE-OpenStack-Cloud-8-Pool
    SUSE-OpenStack-Cloud-8-Updates
    ardana > for repo in $(zypper repos --show-enabled-only --uri | grep -e dir:///srv/www/suse-12.3 -e http://localhost:79/SUSE-12-4/ | cut -d '|' -f 2); do sudo zypper modifyrepo --disable "${repo}"; done
    Repository 'PTF' has been successfully disabled.
    Repository 'SLES12-SP3-LTSS-Updates' has been successfully disabled.
    Repository 'SLES12-SP3-Pool' has been successfully disabled.
    Repository 'SLES12-SP3-Updates' has been successfully disabled.
    Repository 'SUSE-OpenStack-Cloud-8-Pool' has been successfully disabled.
    Repository 'SUSE-OpenStack-Cloud-8-Updates' has been successfully disabled.
    Note
    Note

    The SLES12-SP3-LTSS-Updates repository should only be present if you have purchased the optional SUSE Linux Enterprise 12 SP3 LTSS support. Whether or not it is configured will not impact the upgrade process.

  2. Remove the PTF repository, which is based on SUSE Linux Enterprise 12 SP3. A new one based on SUSE Linux Enterprise 12 SP4 will be created during the upgrade process.

    ardana > zypper repos | grep PTF
     2 | PTF                                                               | PTF                                      | Yes     | (r ) Yes  | Yes
    ardana > sudo zypper removerepo PTF
    Removing repository 'PTF' ..............................................................................................[done]
    Repository 'PTF' has been removed.
  3. Remove the Cloud media repository if defined.

    ardana > zypper repos | grep '[|] Cloud '
     1 | Cloud                          | SUSE OpenStack Cloud 8 DVD #1  | Yes     | (r ) Yes  | No
    ardana > sudo zypper removerepo Cloud
    Removing repository 'SUSE OpenStack Cloud 8 DVD #1' ....................................................................[done]
    Repository 'SUSE OpenStack Cloud 8 DVD #1' has been removed.
  4. Ensure the deployer node has access to the SUSE Linux Enterprise 12 SP4 and SUSE OpenStack Cloud 9 CLM repositories as documented in Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 17 “Software Repository Setup” paying attention to the non-SMT based repository setup. When you run zypper repos --show-enabled-only, the output should look similar to the following:

    ardana > zypper repos --show-enabled-only
    #  | Alias                          | Name                           | Enabled | GPG Check | Refresh
    ---+--------------------------------+--------------------------------+---------+-----------+--------
     1 | Cloud                          | SUSE OpenStack Cloud 9 DVD #1  | Yes     | (r ) Yes  | No
     7 | SLES12-SP4-Pool                | SLES12-SP4-Pool                | Yes     | (r ) Yes  | No
     8 | SLES12-SP4-Updates             | SLES12-SP4-Updates             | Yes     | (r ) Yes  | Yes
     9 | SUSE-OpenStack-Cloud-9-Pool    | SUSE-OpenStack-Cloud-9-Pool    | Yes     | (r ) Yes  | No
    10 | SUSE-OpenStack-Cloud-9-Updates | SUSE-OpenStack-Cloud-9-Updates | Yes     | (r ) Yes  | Yes
    Note
    Note

    The Cloud repository above is optional. Its content is equivalent to the SUSE-Openstack-Cloud-9-Pool repository.

  5. Run the zypper dist-upgrade command to upgrade the deployer node:

    ardana > sudo zypper dist-upgrade
    
    Warning: You are about to do a distribution upgrade with all enabled repositories. Make sure these repositories are compatible before you continue. See 'man zypper' for more information about this command.
    Loading repository data...
    Reading installed packages...
    Computing distribution upgrade...
    
    ...
    
    525 packages to upgrade, 14 to downgrade, 62 new, 5 to remove, 1 to change vendor, 1 to change arch.
    Overall download size: 1.24 GiB. Already cached: 0 B. After the operation, additional 780.8 MiB will be used.
    Continue? [y/n/...? shows all options] (y): y
    ...
        dracut: *** Generating early-microcode cpio image ***
        dracut: *** Constructing GenuineIntel.bin ****
        dracut: *** Store current command line parameters ***
        dracut: Stored kernel commandline:
        dracut:  rd.lvm.lv=ardana-vg/root
        dracut:  root=/dev/mapper/ardana--vg-root rootfstype=ext4 rootflags=rw,relatime,data=ordered
        dracut: *** Creating image file '/boot/initrd-4.4.180-94.127-default' ***
        dracut: *** Creating initramfs image file '/boot/initrd-4.4.180-94.127-default' done ***
    
    Output of btrfsmaintenance-0.2-18.1.noarch.rpm %posttrans script:
        Refresh script btrfs-scrub.sh for monthly
        Refresh script btrfs-defrag.sh for none
        Refresh script btrfs-balance.sh for weekly
        Refresh script btrfs-trim.sh for none
    
    There are some running programs that might use files deleted by recent upgrade. You may wish to check and restart some of them. Run 'zypper ps -s' to list these programs.
    Note
    Note

    You may need to run the zypper dist-upgrade command more than once, if it determines that it needs to update the zypper infrastructure on your system to be able to successfully dist-upgrade the node; the command will tell you if you need to run it again.

15.4.2 Upgrading the Deployer Node Configuration Settings

Now that the deployer node packages have been migrated to SUSE Linux Enterprise 12 SP4 and SUSE OpenStack Cloud 9 Cloud Lifecycle Manager, we need to update the configuration settings to be SUSE OpenStack Cloud 9 Cloud Lifecycle Manager based.

The first step is to run the ardana-init command. This will:

  • Add the PTF repository, creating it if needed.

  • Optionally add appropriate local repository references for any SMT-provided SUSE Linux Enterprise 12 SP4 and SUSE OpenStack Cloud 9 repositories.

  • Upgrade the deployer account ~/openstack area to be based upon SUSE OpenStack Cloud 9 Cloud Lifecycle Manager Ansible sources.

    • This will import the new SUSE OpenStack Cloud 9 Cloud Lifecycle Manager Ansible code into the Git repository on the Ardana branch, and then rebase the customer site branch on top of the updated Ardana branch.

    • Follow the directions to resolve any Git merge conflicts that may arise due to local changes that may have been made on the site branch:

      ardana > ardana-init
      ...
       To continue installation copy your cloud layout to:
           /var/lib/ardana/openstack/my_cloud/definition
      
       Then execute the installation playbooks:
           cd /var/lib/ardana/openstack/ardana/ansible
           git add -A
           git commit -m 'My config'
           ansible-playbook -i hosts/localhost cobbler-deploy.yml
           ansible-playbook -i hosts/localhost bm-reimage.yml
           ansible-playbook -i hosts/localhost config-processor-run.yml
           ansible-playbook -i hosts/localhost ready-deployment.yml
           cd /var/lib/ardana/scratch/ansible/next/ardana/ansible
           ansible-playbook -i hosts/verb_hosts site.yml
      
       If you prefer to use the UI to install the product, you can
       do either of the following:
           - If you are running a browser on this machine, you can point
             your browser to http://localhost:9085 to start the install
             via the UI.
           - If you are running the browser on a remote machine, you will
             need to create an ssh tunnel to access the UI.  Please refer
             to the Ardana installation documentation for further details.
Note
Note

As we are upgrading to SUSE OpenStack Cloud 9 Cloud Lifecycle Manager, we do not need to run the suggested bm-reimage.yml playbook.

Procedure 15.8: Updating the Bare-Metal Provisioning Configuration

If you were previously using the cobbler-based integrated provisioning solution, then you will need to perform the following steps to import the SUSE Linux Enterprise 12 SP4 ISO and update the default provisioning distribution:

  1. Ensure there is a copy of the SLE-12-SP4-Server-DVD-x86_64-GM-DVD1.iso, named sles12sp4.iso, available in the /var/lib/ardana directory.

  2. Ensure that any distribution entries in servers.yml (or whichever file holds the server node definitions) under ~/openstack/my_cloud/definition are updated to specify sles12sp4 if they are currently using sles12sp3.

    Note
    Note

    The default distribution will now be sles12sp4, so if there are no specific distribution entries specified for the servers, then no change will be required.

    If you have made any changes to the ~/openstack/my_cloud/definition files, you will need to commit those changes, as follows:

    ardana > cd ~/openstack/my_cloud/definition
     ardana > git add -A
     ardana > git commit -m "Update sles12sp3 distro entries to sles12sp4"
  3. Run the cobbler-deploy.yml playbook to import the SUSE Linux Enterprise 12 SP4 distribution as the new default distribution:

    ardana > cd ~/openstack/ardana/ansible
     ardana > ansible-playbook -i hosts/localhost cobbler-deploy.yml
     Enter the password that will be used to access provisioned nodes:
     confirm Enter the password that will be used to access provisioned nodes:
    
     PLAY [localhost] **************************************************************
    
     GATHERING FACTS ***************************************************************
     ok: [localhost]
    
     TASK: [pbstart.yml pb_start_playbook] *****************************************
     ok: [localhost] => {
         "msg": "Playbook started - cobbler-deploy.yml"
     }
    
     msg: Playbook started - cobbler-deploy.yml
    
     ...
    
     PLAY [localhost] **************************************************************
    
     TASK: [pbfinish.yml pb_finish_playbook] ***************************************
     ok: [localhost] => {
         "msg": "Playbook finished - cobbler-deploy.yml"
     }
    
     msg: Playbook finished - cobbler-deploy.yml
    
     PLAY RECAP ********************************************************************
     localhost                  : ok=92   changed=45   unreachable=0    failed=0

You are now ready to upgrade the input model to be compatible.

Procedure 15.9: Upgrading the Cloud Input Model

At this point, there are some mandatory changes that will need to be made to the existing input model to permit the upgrade proceed. These mandatory changes represent:

  • The removal of previously-deprecated service components;

  • The dropping of service components that are no longer supported;

  • That there can be only one untagged VLAN per network interface;

  • That there must be a MANAGEMENT network group.

There are also some service components that have been made redundant and have no effect. These should be removed to quieten the associated config-processor-run.yml warnings.

For example, if you run the configuration-processor-run.yml playbook from the ~/openstack/ardana/ansible directory before you made the necessary input model changes, you should see it fail with errors similar to those shown below—unless your input model doesn't deploy the problematic service component:

ardana > cd ~/openstack/ardana/ansible
 ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
 Enter encryption key (press return for none):
 confirm Enter encryption key (press return for none):
 To change encryption key enter new key (press return for none):
 confirm To change encryption key enter new key (press return for none):

 PLAY [localhost] **************************************************************

 GATHERING FACTS ***************************************************************
 ok: [localhost]

 ...

             "################################################################################",
             "# The configuration processor failed.  ",
             "#   control-planes-2.0        WRN: cp:openstack-core: 'designate-pool-manager' has been deprecated and will be replaced by 'designate-worker'. The replacement component will be automatically deployed in a future release. You will then need to update the input model to remove this warning.",
             "",
             "#   control-planes-2.0        WRN: cp:openstack-core: 'manila-share' service component is deprecated. The 'manila-share' service component can be removed as manila share service will be deployed where manila-api is specified. This is not a deprecation for openstack-manila-share but just an entry deprecation in input model.",
             "",
             "#   control-planes-2.0        WRN: cp:openstack-core: 'designate-zone-manager' has been deprecated and will be replaced by 'designate-producer'. The replacement component will be automatically deployed in a future release. You will then need to update the input model to remove this warning.",
             "",
             "#   control-planes-2.0        WRN: cp:openstack-core: 'glance-registry' has been deprectated and is no longer deployed. Please update you input model to remove any 'glance-registry' service component specifications to remove this warning.",
             "",
             "#   control-planes-2.0        WRN: cp:mml: 'ceilometer-api' is no longer used by Ardana and will not be deployed. Please update your input model to remove this warning.",
             "",
             "#   control-planes-2.0        WRN: cp:sles-compute: 'neutron-lbaasv2-agent' has been deprecated and replaced by 'octavia' and will not be deployed in a future release. Please update your input model to remove this warning.",
             "",
             "#   control-planes-2.0        ERR: cp:common-service-components: Undefined component 'freezer-agent'",
             "#   control-planes-2.0        ERR: cp:openstack-core: Undefined component 'nova-console-auth'",
             "#   control-planes-2.0        ERR: cp:openstack-core: Undefined component 'heat-api-cloudwatch'",
             "#   control-planes-2.0        ERR: cp:mml: Undefined component 'freezer-api'",
             "################################################################################"
         ]
     }
 }

 TASK: [debug var=config_processor_result.stderr] ******************************
 ok: [localhost] => {
     "var": {
         "config_processor_result.stderr": "/usr/lib/python2.7/site-packages/ardana_configurationprocessor/cp/model/YamlConfigFile.py:95: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.\n  self._contents = yaml.load(''.join(lines))"
     }
 }

 TASK: [fail msg="Configuration processor run failed, see log output above for details"] ***
 failed: [localhost] => {"failed": true}
 msg: Configuration processor run failed, see log output above for details

 msg: Configuration processor run failed, see log output above for details

 FATAL: all hosts have already failed -- aborting

 PLAY RECAP ********************************************************************
            to retry, use: --limit @/var/lib/ardana/config-processor-run.retry

 localhost                  : ok=8    changed=5    unreachable=0    failed=1

To resolve any errors and warnings like those shown above, you will need to perform the following actions:

  1. Remove any service component entries that are no longer valid from the control_plane.yml (or whichever file holds the control-plane definitions) under ~/openstack/my_cloud/definition. This means that you have to comment out (or delete) any lines for the following service components, which are no longer available:

    • freezer-agent

    • freezer-api

    • heat-api-cloudwatch

    • nova-console-auth

    Note
    Note

    This should resolve the errors that cause the config-processor-run.yml playbook to fail.

  2. Similarly, remove any service components that are redundant and no longer required. This means that you should comment out (or delete) any lines for the following service components:

    • ceilometer-api

    • glance-registry

    • manila-share

    • neutron-lbaasv2-agent

    Note
    Note

    This should resolve most of the warnings reported by the config-processor-run.yml playbook.

    Important
    Important

    If you have deployed the designate service components (designate-pool-manager and designate-zone-manager) in your cloud, you will see warnings like those shown above, indicating that these service components have been deprecated.

    You can switch to using the newer designate-worker and designate-producer service components, which will quieten these deprecation warnings produced by the config-processor-run.yml playbook run.

    However, this is a procedure that should be perfomed after the upgrade has completed, as outlined in the Section 15.4.5, “Post-Upgrade Tasks” section below.

  3. Once you have made the necessary changes to your input model, if you run git diff under the ~/openstack/my_cloud/definition directory, you should see output similar to the following:

    ardana > cd ~/openstack/my_cloud/definition
     ardana > git diff
     diff --git a/my_cloud/definition/data/control_plane.yml b/my_cloud/definition/data/control_plane.yml
     index f7cfd84..2c1a73c 100644
     --- a/my_cloud/definition/data/control_plane.yml
     +++ b/my_cloud/definition/data/control_plane.yml
     @@ -32,7 +32,6 @@
              - NEUTRON-CONFIG-CP1
            common-service-components:
              - lifecycle-manager-target
     -        - freezer-agent
              - stunnel
              - monasca-agent
              - logging-rotate
     @@ -118,12 +117,10 @@
                  - cinder-volume
                  - cinder-backup
                  - glance-api
     -            - glance-registry
                  - nova-api
                  - nova-placement-api
                  - nova-scheduler
                  - nova-conductor
     -            - nova-console-auth
                  - nova-novncproxy
                  - neutron-server
                  - neutron-ml2-plugin
     @@ -137,7 +134,6 @@
                  - horizon
                  - heat-api
                  - heat-api-cfn
     -            - heat-api-cloudwatch
                  - heat-engine
                  - ops-console-web
                  - barbican-api
     @@ -151,7 +147,6 @@
                  - magnum-api
                  - magnum-conductor
                  - manila-api
     -            - manila-share
    
              - name: mml
                cluster-prefix: mml
     @@ -164,9 +159,7 @@
    
                  # freezer-api shares elastic-search with logging-server
                  # so must be co-located with it
     -            - freezer-api
    
     -            - ceilometer-api
                  - ceilometer-polling
                  - ceilometer-agent-notification
                  - ceilometer-common
     @@ -194,4 +187,3 @@
                  - neutron-l3-agent
                  - neutron-metadata-agent
                  - neutron-openvswitch-agent
     -            - neutron-lbaasv2-agent
  4. If you are happy with these changes, commit them into the Git repository as follows:

    ardana > cd ~/openstack/my_cloud/definition
     ardana > git add -A
     ardana > git commit -m "SOC 9 CLM Upgrade input model migration"
  5. Now you are ready to run the config-processor-run.yml playbook. If the necessary input model changes have been made, it will complete sucessfully:

    ardana > cd ~/openstack/ardana/ansible
     ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
     Enter encryption key (press return for none):
     confirm Enter encryption key (press return for none):
     To change encryption key enter new key (press return for none):
     confirm To change encryption key enter new key (press return for none):
    
     PLAY [localhost] **************************************************************
    
     GATHERING FACTS ***************************************************************
     ok: [localhost]
    
     ...
     PLAY RECAP ********************************************************************
     localhost                  : ok=24   changed=20   unreachable=0    failed=0

15.4.3 Upgrading Cloud Services

The deployer node is now ready to be used to upgrade the remaining cloud nodes and running services.

Warning
Warning

If upgrading from Helion OpenStack 8, there is a manual file update that must be applied before continuing the upgrade process. In the file /usr/share/ardana/ansible/roles/osconfig/tasks/check-product-status.yml replace `command` with `shell` in the first ansible entry. The correct version of the file appears below.

- name: deployer-setup | check-product-status | Check HOS product installed
  shell: |-
    zypper info hpe-helion-openstack-release | grep "^Installed *: *Yes"
  ignore_errors: yes
  register: product_flavor_hos

- name: deployer-setup | check-product-status | Check SOC product availability
  become: yes
  zypper:
    name: "suse-openstack-cloud-release>=8"
    state: present
  ignore_errors: yes
  register: product_flavor_soc

- name: deployer-setup | check-product-status | Provide help
  fail:
    msg: >
      The deployer node does not have a Cloud Add-On product installed.
      In YaST select Software/Add-On Products to see an overview of installed
       add-on products and use "Add" to add the Cloud product.
  when:
    - product_flavor_soc|failed
    - product_flavor_hos|failed

Changes to the check-product-status.yml file must be staged and committed via git.

git add -u
git commit -m "applying osconfig fix prior to HOS8 to SOC9 upgrade"
Note
Note

The ardana-upgrade.yml playbook runs the upgrade process against all nodes in parallel, though some of the steps are serialised to run on only one node at a time to avoid triggering potentially problematic race conditions. As such, the playbook can take a long time to run.

Procedure 15.10: Generate the SUSE OpenStack Cloud 9 Cloud Lifecycle Manager Based Scratch Area
  1. Generate the updated scratch area using the SUSE OpenStack Cloud 9 Cloud Lifecycle Manager Ansible sources:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
    
    PLAY [localhost] **************************************************************
    
    GATHERING FACTS ***************************************************************
    ok: [localhost]
    
    ...
    
    PLAY RECAP ********************************************************************
    localhost                  : ok=31   changed=16   unreachable=0    failed=0
  2. Confirm that there are no pending updates for the deployer node. This could happen if you are using an SMT to manage the repositories, and updates have been released through the official channels since the deployer node was migrated. To check for any pending Cloud Lifecycle Manager package updates, you can run the ardana-update-pkgs.yml playbook as follows:

     ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-update-pkgs.yml --limit OPS-LM--first-member
    
    PLAY [all] ********************************************************************
    
    TASK: [setup] *****************************************************************
    ok: [ardana-cp-dplyr-m1]
    
    PLAY [localhost] **************************************************************
    
    TASK: [pbstart.yml pb_start_playbook] *****************************************
    ok: [localhost] => {
        "msg": "Playbook started - ardana-update-pkgs.yml"
    }
    
    ...
    
    TASK: [_ardana-update-status | Report update status] **************************
    ok: [ardana-cp-dplyr-m1] => {
        "msg": "=====================================================================\nUpdate status for node ardana-cp-dplyr-m1:\n=====================================================================\nNo pending update actions on the ardana-cp-dplyr-m1 host\nwere collected or reset during this update run or persisted during\nprevious unsuccessful or incomplete update runs.\n\n====================================================================="
    }
    
    msg: =====================================================================
    Update status for node ardana-cp-dplyr-m1:
    =====================================================================
    No pending update actions on the ardana-cp-dplyr-m1 host
    were collected or reset during this update run or persisted during
    previous unsuccessful or incomplete update runs.
    
    =====================================================================
    
    PLAY [localhost] **************************************************************
    
    TASK: [pbfinish.yml pb_finish_playbook] ***************************************
    ok: [localhost] => {
        "msg": "Playbook finished - ardana-update-pkgs.yml"
    }
    
    msg: Playbook finished - ardana-update-pkgs.yml
    
    PLAY RECAP ********************************************************************
    ardana-cp-dplyr-m1         : ok=98   changed=12   unreachable=0    failed=0
    localhost                  : ok=6    changed=2    unreachable=0    failed=0
    Note
    Note

    If running the ardana-update-pkgs.yml playbook identifies that there were updates that needed to be installed on your deployer node, then you need to go back to running the ardana-init command, followed by the cobbler-deploy.yml playbook, then the config-processor-run.yml playbook, and finally the ready-deployment.yml playbook, addressing any additional input model changes that may be needed. Then, repeat this step to check for any pending updates before continuing with the upgrade.

  3. Double-check that there are no pending actions needed for the deployer node by running the ardana-update-status.yml playbook, as follows:

    ardana > ansible-playbook -i hosts/verb_hosts ardana-update-status.yml --limit OPS-LM--first-member
    
    PLAY [resources] **************************************************************
    
    ...
    
    TASK: [_ardana-update-status | Report update status] **************************
    ok: [ardana-cp-dplyr-m1] => {
        "msg": "=====================================================================\nUpdate status for node ardana-cp-dplyr-m1:\n=====================================================================\nNo pending update actions on the ardana-cp-dplyr-m1 host\nwere collected or reset during this update run or persisted during\nprevious unsuccessful or incomplete update runs.\n\n====================================================================="
    }
    
    msg: =====================================================================
    Update status for node ardana-cp-dplyr-m1:
    =====================================================================
    No pending update actions on the ardana-cp-dplyr-m1 host
    were collected or reset during this update run or persisted during
    previous unsuccessful or incomplete update runs.
    
    =====================================================================
    
    PLAY RECAP ********************************************************************
    ardana-cp-dplyr-m1         : ok=12   changed=0    unreachable=0    failed=0
  4. Having verified that there are no pending actions detected, it is safe to proceed with running the ardana-upgrade.yml playbook to upgrade the entire cloud:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-upgrade.yml
    PLAY [all] ********************************************************************
    
    ...
    
    TASK: [pbstart.yml pb_start_playbook] *****************************************
    ok: [localhost] => {
        "msg": "Playbook started - ardana-upgrade.yml"
    }
    
    msg: Playbook started - ardana-upgrade.yml
    
    ...
    ...
    ...
    ...
    ...
    
    TASK: [pbfinish.yml pb_finish_playbook] ***************************************
    ok: [localhost] => {
        "msg": "Playbook finished - ardana-upgrade.yml"
    }
    
    msg: Playbook finished - ardana-upgrade.yml

The ardana-upgrade.yml playbook run will take a long time. The zypper dist-upgrade phase is serialised across all of the nodes and usually takes between five and 10 minutes for each node. This is followed by the cloud service upgrade phase, which will take approximately the same amount of time as a full cloud deploy. During this time, the cloud should remain basically functional, though there may be brief interruptions to some services. However, it is recommended that any workload management tasks are avoided during this period.

Note
Note

Until the ardana-upgrade.yml playbook run has ompleted successfully, other playbooks such as the ardana-status.yml, may report status problems. This is because some services that are expected to be running may not be installed, enabled, or migrated yet.

The ardana-upgrade.yml playbook run may sometimes fail during the whole cloud upgrade phase, if a service (for example, the monasca-thresh service) is slow to restart. In such cases, it is safe to run the ardana-upgrade.yml playbook again, and in most cases it should continue past the stage that failed previously. However, if the same problem persists across multiple runs, contact your support team for assistance.

Important
Important

It is important to disable all SUSE Linux Enterprise 12 SP3 SUSE OpenStack Cloud 8 Cloud Lifecycle Manager repositories before migrating the deployer to SUSE Linux Enterprise 12 SP4 SUSE OpenStack Cloud 9 Cloud Lifecycle Manager. If you did not do this, then the first time you run the ardana-upgrade.yml playbook, it may complain that there are pending updates for the deployer node. This will require you to repeat the earlier steps to upgrade the deployer node, starting with running the ardana-init command. If this happens, repeat the steps as requested. Note that this does not represent a serious problem.

In SUSE OpenStack Cloud 9 Cloud Lifecycle Manager the LBaaS V2 legacy driver has been deprecated and removed. As part of the ardana-upgrade.yml playbook run, all existing LBaaS V2 load-balancers will be automatically migrated to being based on the Octavia Amphora provider. To enable creation of any new Octavia- based load-balancer instances, you need to ensure that an appropriate Amphora image is registered for use when creating instances, by following Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 43 “Configuring Load Balancer as a Service”.

Note
Note

While running the ardana-upgrade.yml playbook, a point will be reached when the Neutron services are upgraded. As part of this upgrade, any existing LBaaS V2 load-balancer definitions will be migrated to Octavia Amphora-based load-balancer definitions.

After this migration of load-balancer definitions has completed, if a load-balancer failover is triggered, then the replacement load- balancer may fail to start, as an appropriate Octavia Amphora image for SUSE OpenStack Cloud 9 Cloud Lifecycle Manager will not yet be available.

However, once the Octavia Amphora image has been uploaded using the above instructions, then it will be possible to recover any failed load-balancers by re-triggering the failover: follow the instructions at https://docs.openstack.org/python-octaviaclient/latest/cli/index.html#loadbalancer-failover.

15.4.4 Rebooting the Nodes into the SUSE Linux Enterprise 12 SP4 Kernel

At this point, all of the cloud services have been upgraded, but the nodes are still running the SUSE Linux Enterprise 12 SP3 kernel. The final step in the upgrade workflow is to reboot all of the nodes in the cloud in a controlled fashion, to ensure that active services failover appropriately.

The recommended order for rebooting nodes is to start with the deployer. This requires special handling, since the Ansible-based automation cannot fully manage the reboot of the node that it is running on.

After that, we recommend rebooting the rest of the nodes in the control planes in a rolling-reboot fashion, ensuring that high-availability services remain available.

Finally, the compute nodes can be rebooted, either individually or in groups, as is appropriate to avoid interruptions to running workloads.

Warning
Warning

Do not reboot all your control plane nodes at the same time.

Procedure 15.11: Rebooting the Deployer Node

The reboot of the deployer node requires additional steps, as the Ansible-based automation framework cannot fully automate the reboot of the node that runs the ansible-playbook commands.

  1. Run the ardana-reboot.yml playbook limited to the deployer node, either by name, or using the logical node identified OPS-LM--first-member, as follows:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit OPS-LM--first-member
    
    PLAY [all] ********************************************************************
    
    TASK: [setup] *****************************************************************
    ok: [ardana-cp-dplyr-m1]
    
    PLAY [localhost] **************************************************************
    
    TASK: [pbstart.yml pb_start_playbook] *****************************************
    ok: [localhost] => {
        "msg": "Playbook started - ardana-reboot.yml"
    }
    
    msg: Playbook started - ardana-reboot.yml
    
    ...
    
    TASK: [ardana-reboot | Deployer node has to be rebooted manually] *************
    failed: [ardana-cp-dplyr-m1] => {"failed": true}
    msg: The deployer node needs to be rebooted manually. After reboot, resume by running the post-reboot playbook:
    cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts _ardana-post-reboot.yml --limit ardana-cp-dplyr-m1
    
    msg: The deployer node needs to be rebooted manually. After reboot, resume by running the post-reboot playbook:
    cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts _ardana-post-reboot.yml --limit ardana-cp-dplyr-m1
    
    FATAL: all hosts have already failed -- aborting
    
    PLAY RECAP ********************************************************************
               to retry, use: --limit @/var/lib/ardana/ardana-reboot.retry
    
    ardana-cp-dplyr-m1         : ok=8    changed=3    unreachable=0    failed=1
    localhost                  : ok=7    changed=0    unreachable=0    failed=0

    The ardana-reboot.yml playbook will fail when run on a deployer node; this is expected. The reported failure message tells you what you need to do to complete the remaining steps of the reboot manually: namely, rebooting the node, then logging back in again to run the _ardana-post-reboot.yml playbook, to start any services that need to be running on the node.

  2. Manually reboot the deployer node, for example with shutdown -r now.

  3. Once the deployer node has rebooted, you need to log in again and run the _ardana-post-reboot.yml playbook to complete the startup of any services that should be running on the deployer node, as follows:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook _ardana-post-reboot.yml --limit OPS-LM--first-member
    
    PLAY [resources] **************************************************************
    
    TASK: [Set pending_clm_update] ************************************************
    skipping: [ardana-cp-dplyr-m1]
    
    ...
    
    TASK: [pbfinish.yml pb_finish_playbook] ***************************************
    ok: [localhost] => {
        "msg": "Playbook finished - ardana-status.yml"
    }
    
    msg: Playbook finished - ardana-status.yml
    
    PLAY RECAP ********************************************************************
    ardana-cp-dplyr-m1         : ok=26   changed=0    unreachable=0    failed=0
    localhost                  : ok=19   changed=1    unreachable=0    failed=0
Procedure 15.12: Rebooting the Remaining Control Plane Nodes

For the remaining nodes, you can use ardana-reboot.yml to fully automate the reboot process. However, it is recommended that you reboot the nodes in a rolling-reboot fashion, such that high-availability services continue to run without interruption. Similarly, to avoid interruption of service for any singleton services (such as the cinder-volume and cinder-backup services), they should be migrated off the intended node before it is rebooted, and then migrated back again afterwards.

  1. Use the ansible command's --list-hosts option to list the remaining nodes in the cloud that are neither the deployer nor a compute node:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible -i hosts/verb_hosts --list-hosts 'resources:!OPS-LM--first-member:!NOV-CMP'
        ardana-cp-dbmqsw-m1
        ardana-cp-dbmqsw-m2
        ardana-cp-dbmqsw-m3
        ardana-cp-osc-m1
        ardana-cp-osc-m2
        ardana-cp-mml-m1
        ardana-cp-mml-m2
        ardana-cp-mml-m3
  2. Use the following command to generate the set of ansible-playbook commands that need to be run to reboot all the nodes sequentially:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > for node in $(ansible -i hosts/verb_hosts --list-hosts 'resources:!OPS-LM--first-member:!NOV-CMP'); do echo ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ${node} || break; done
    ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ardana-cp-dbmqsw-m1
    ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ardana-cp-dbmqsw-m2
    ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ardana-cp-dbmqsw-m3
    ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ardana-cp-osc-m1
    ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ardana-cp-osc-m2
    ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ardana-cp-mml-m1
    ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ardana-cp-mml-m2
    ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ardana-cp-mml-m3
    Warning
    Warning

    Do not reboot all your control-plane nodes at the same time.

  3. To reboot a specific control-plane node, you can use the above ansible-playbook commands as follows:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ardana-cp-mml-m3
    
    PLAY [all] ********************************************************************
    
    TASK: [setup] *****************************************************************
    ok: [ardana-cp-mml-m3]
    
    PLAY [localhost] **************************************************************
    
    TASK: [pbstart.yml pb_start_playbook] *****************************************
    ok: [localhost] => {
        "msg": "Playbook started - ardana-reboot.yml"
    }
    
    msg: Playbook started - ardana-reboot.yml
    
    
    
    ...
    
    PLAY [localhost] **************************************************************
    
    TASK: [pbfinish.yml pb_finish_playbook] ***************************************
    ok: [localhost] => {
        "msg": "Playbook finished - ardana-status.yml"
    }
    
    msg: Playbook finished - ardana-status.yml
    
    PLAY [localhost] **************************************************************
    
    TASK: [pbfinish.yml pb_finish_playbook] ***************************************
    ok: [localhost] => {
        "msg": "Playbook finished - ardana-reboot.yml"
    }
    
    msg: Playbook finished - ardana-reboot.yml
    
    PLAY RECAP ********************************************************************
    ardana-cp-mml-m3           : ok=389  changed=105  unreachable=0    failed=0
    localhost                  : ok=27   changed=1    unreachable=0    failed=0
Note
Note

You can reboot more than one control-plane node at a time, but only if they are members of different control-plane clusters. For example, you could reboot one node out of each of the Openstack controller, database, swift, monitoring or logging clusters, so long as doing do only reboots one node out of each cluster at the same time.

When rebooting the first member of the control-plane cluster where monitoring services run, the monasca-thresh service can sometimes fail to start up in a timely fashion when the node is coming back up after being rebooted. This can cause ardana-reboot.yml to fail. See below for suggestions on how to handle this problem.

Procedure 15.13: Getting monasca-thresh Running After an ardana-reboot.yml Failure

If the ardana-reboot.yml playbook failed because monasca-thresh didn't start up in a timely fashion after a reboot, you can retry starting the services on the node using the _ardana-post-reboot.yml playbook for the node. This is similar to the manual handling of the deployer reboot, since the node has already successfully rebooted onto the new kernel, and you just need to get the required services running again on the node.

It can sometimes take up to 15 minutes for the monasca-thresh service to successfully start in such cases.

  • However, if the service still fails to start after that time, then you may need to force a restart of the storm-nimbus and storm-supervisor services on all nodes in the MON-THR node group, as follows:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible MON-THR -b -m shell -a "systemctl restart storm-nimbus"
    ardana-cp-mml-m1 | success | rc=0 >>
    
    
    ardana-cp-mml-m2 | success | rc=0 >>
    
    
    ardana-cp-mml-m3 | success | rc=0 >>
    
    
    ardana > ansible MON-THR -b -m shell -a "systemctl restart storm-supervisor"
    ardana-cp-mml-m1 | success | rc=0 >>
    
    
    ardana-cp-mml-m2 | success | rc=0 >>
    
    
    ardana-cp-mml-m3 | success | rc=0 >>
    
    
    ardana > ansible-playbook -i hosts/verb_hosts _ardana-post-reboot.yml --limit ardana-cp-mml-m1

If the monasca-thresh service still fails to start up, contact your support team.

To check which control plane nodes have not yet been rebooted onto the new kernel, you can use an Ansible command to run the command uname -r on the target nodes, as follows:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible -i hosts/verb_hosts 'resources:!OPS-LM--first-member:!NOV-CMP' -m command -a 'uname -r'
ardana-cp-dbmqsw-m1 | success | rc=0 >>
4.12.14-95.57-default

ardana-cp-dbmqsw-m3 | success | rc=0 >>
4.12.14-95.57-default

ardana-cp-osc-m1 | success | rc=0 >>
4.12.14-95.57-default

ardana-cp-dbmqsw-m2 | success | rc=0 >>
4.12.14-95.57-default

ardana-cp-mml-m2 | success | rc=0 >>
4.12.14-95.57-default

ardana-cp-osc-m2 | success | rc=0 >>
4.12.14-95.57-default

ardana-cp-mml-m1 | success | rc=0 >>
4.12.14-95.57-default

ardana-cp-mml-m3 | success | rc=0 >>
4.12.14-95.57-default

ardana > uname -r
4.12.14-95.57-default

If any node's uname -r value does not match the kernel that the deployer is running, you probably have not yet rebooted that node.

Procedure 15.14: Rebooting the Compute Nodes

Finally, you need to reboot the compute nodes. Rebooting multiple compute nodes at the same time is possible, so long as doing so does not compromise the integrity of running workloads. We recommended that you migrate workloads off groups of compute nodes in a controlled fashion, enabling them to be rebooted together.

Warning
Warning

Do not reboot all of your compute nodes at the same time.

  1. To see all the compute nodes that are available to be rebooted, you can run the following command:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
        ardana > ansible -i hosts/verb_hosts --list-hosts NOV-CMP
        ardana-cp-slcomp0001
        ardana-cp-slcomp0002
    ...
        ardana-cp-slcomp0080
  2. Reboot the compute nodes, individually or in groups, using the ardana-reboot.yml playbook as follows:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
        ardana > ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ardana-cp-slcomp0001,ardana-cp-slcomp0002
    
    PLAY [all] ********************************************************************
    
    TASK: [setup] *****************************************************************
    ok: [ardana-cp-slcomp0001]
    ok: [ardana-cp-slcomp0002]
    
    PLAY [localhost] **************************************************************
    
    TASK: [pbstart.yml pb_start_playbook] *****************************************
    ok: [localhost] => {
        "msg": "Playbook started - ardana-reboot.yml"
    }
    
    msg: Playbook started - ardana-reboot.yml
    
    ...
    
    PLAY [localhost] **************************************************************
    
    TASK: [pbfinish.yml pb_finish_playbook] ***************************************
    ok: [localhost] => {
        "msg": "Playbook finished - ardana-status.yml"
    }
    
    msg: Playbook finished - ardana-status.yml
    
    PLAY [localhost] **************************************************************
    
    TASK: [pbfinish.yml pb_finish_playbook] ***************************************
    ok: [localhost] => {
        "msg": "Playbook finished - ardana-reboot.yml"
    }
    
    msg: Playbook finished - ardana-reboot.yml
    
    PLAY RECAP ********************************************************************
    ardana-cp-slcomp0001       : ok=120  changed=11   unreachable=0    failed=0
    ardana-cp-slcomp0002       : ok=120  changed=11   unreachable=0    failed=0
    localhost                  : ok=27   changed=1    unreachable=0    failed=0
Important
Important

You must ensure that there is sufficient unused workload capacity to host any migrated workload or Amphora instances that may be running on the targeted compute nodes.

When rebooting multiple compute nodes at the same time, consider manually migrating any running workloads and Amphora instances off the target nodes in advance, to avoid any potential risk of workload or service interruption.

15.4.5 Post-Upgrade Tasks

After the cloud has been upgraded to SUSE OpenStack Cloud 9 Cloud Lifecycle Manager, if designate was previously configured, then the deprecated service components, designate-zone-manager and designate-pool-manager, were being used.

They will continue to operate correctly under SUSE OpenStack Cloud 9 Cloud Lifecycle Manager, but we recommend that you migrate to using the newer designate-worker designate-producer service components instead by following the procedure documented in Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 25 “DNS Service Installation Overview”, Section 25.4 “Migrate Zone/Pool to Worker/Producer after Upgrade”.

Procedure 15.15: Cleanup Orphaned Packages
  • After migrating the deployer node, there are a small number of packages that were installed that are no longer required—such as the ceilometer and freezer virtualenv (venv) packages.

    You can safely remove these packages with the following command:

    ardana > zypper packages --orphaned
    Loading repository data...
    Reading installed packages...
    S | Repository | Name                             | Version                           | Arch
    --+------------+----------------------------------+-----------------------------------+-------
    i | @System    | python-flup                      | 1.0.3.dev_20110405-2.10.52        | noarch
    i | @System    | python-happybase                 | 0.9-1.64                          | noarch
    i | @System    | venv-openstack-ceilometer-x86_64 | 9.0.8~dev7-12.24.2                | noarch
    i | @System    | venv-openstack-freezer-x86_64    | 5.0.0.0~xrc2~dev2-10.22.1         | noarch
    ardana> sudo zypper remove venv-openstack-ceilometer-x86_64 venv-openstack-freezer-x86_64
    Loading repository data...
    Reading installed packages...
    Resolving package dependencies...
    
    The following 2 packages are going to be REMOVED:
      venv-openstack-ceilometer-x86_64 venv-openstack-freezer-x86_64
    
    2 packages to remove.
    After the operation, 79.0 MiB will be freed.
    Continue? [y/n/...? shows all options] (y): y
    (1/2) Removing venv-openstack-ceilometer-x86_64-9.0.8~dev7-12.24.2.noarch ..................................................................[done]
    Additional rpm output:
    /usr/lib/python2.7/site-packages/ardana_packager/indexer.py:148: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
      return yaml.load(f)
    
    
    (2/2) Removing venv-openstack-freezer-x86_64-5.0.0.0~xrc2~dev2-10.22.1.noarch ..............................................................[done]
    Additional rpm output:
    /usr/lib/python2.7/site-packages/ardana_packager/indexer.py:148: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
      return yaml.load(f)
Procedure 15.16: Delete freezer Containers from swift

The freezer service has been deprecated and removed from SUSE OpenStack Cloud 9 Cloud Lifecycle Manager, but the backups that the freezer service created before you upgraded will still be consuming space in your Swift Object store.

Therefore, once you have completed the upgrade successfully, you can safely delete the containers that freezer used to hold the database and ring backups, freeing up that space.

  • Using the credentials in the backup.osrc file, found on the deployer node in the Ardana account's home directory, run the following commands:

    ardana > . ~/backup.osrc
    ardana > swift list
    freezer_database_backups
    freezer_rings_backups
    ardana> swift delete --all
    freezer_database_backups/data/tar/ardana-cp-dbmqsw-m2-host_freezer_mysql_backup/1598505404/1_1598548599/segments/000000021
    freezer_database_backups/data/tar/ardana-cp-dbmqsw-m2-host_freezer_mysql_backup/1598505404/2_1598605266/data1
    ...
    freezer_database_backups/data/tar/ardana-cp-dbmqsw-m2-host_freezer_mysql_backup/1598505404/0_1598505404/segments/000000001
    freezer_database_backups
    freezer_rings_backups/metadata/tar/ardana-cp-dbmqsw-m1-host_freezer_swift_builder_dir_backup/1598548636/0_1598548636/metadata
    ...
    freezer_rings_backups/data/tar/ardana-cp-dbmqsw-m1-host_freezer_swift_builder_dir_backup/1598548636/0_1598548636/data
    freezer_rings_backups

15.5 Cloud Lifecycle Manager Program Temporary Fix (PTF) Deployment

Occasionally, in order to fix a given issue, SUSE will provide a set of packages known as a Program Temporary Fix (PTF). Such a PTF is fully supported by SUSE until the Maintenance Update containing a permanent fix has been released via the regular Update repositories. Customers running PTF fixes will be notified through the related Service Request when a permanent patch for a PTF has been released.

Use the following steps to deploy a PTF:

  1. When SUSE has developed a PTF, you will receive a URL for that PTF. You should download the packages from the location provided by SUSE Support to a temporary location on the Cloud Lifecycle Manager. For example:

    ardana > tmpdir=`mktemp -d`
    ardana > cd $tmpdir
    ardana > wget --no-directories --recursive --reject "index.html*"\
    --user=USER_NAME \
    --ask-password \
    --no-parent https://ptf.suse.com/54321aaaa...dddd12345/cloud8/042171/x86_64/20181030/
  2. Remove any old data from the PTF repository, such as a listing for a PTF repository from a migration or when previous product patches were installed.

    ardana > sudo rm -rf /srv/www/suse-12.4/x86_64/repos/PTF/*
  3. Move packages from the temporary download location to the PTF repository directory on the CLM Server. This example is for a neutron PTF.

    ardana > sudo mkdir -p /srv/www/suse-12.4/x86_64/repos/PTF/
    ardana > sudo mv $tmpdir/*
       /srv/www/suse-12.4/x86_64/repos/PTF/
    ardana > sudo chown --recursive root:root /srv/www/suse-12.4/x86_64/repos/PTF/*
    ardana > rmdir $tmpdir
  4. Create or update the repository metadata:

    ardana > sudo /usr/local/sbin/createrepo-cloud-ptf
    Spawning worker 0 with 2 pkgs
    Workers Finished
    Saving Primary metadata
    Saving file lists metadata
    Saving other metadata
  5. Refresh the PTF repository before installing package updates on the Cloud Lifecycle Manager

    ardana > sudo zypper refresh --force --repo PTF
    Forcing raw metadata refresh
    Retrieving repository 'PTF' metadata
    ..........................................[d
    one]
    Forcing building of repository cache
    Building repository 'PTF' cache ..........................................[done]
    Specified repositories have been refreshed.
  6. The PTF shows as available on the deployer.

    ardana > sudo zypper se --repo PTF
    Loading repository data...
    Reading installed packages...
    
    S | Name                          | Summary                                 | Type
    --+-------------------------------+-----------------------------------------+--------
      | python-neutronclient          | Python API and CLI for OpenStack neutron | package
    i | venv-openstack-neutron-x86_64 | Python virtualenv for OpenStack neutron | package
  7. Install the PTF venv packages on the Cloud Lifecycle Manager

    ardana > sudo zypper dup  --from PTF
    Refreshing service
    Loading repository data...
    Reading installed packages...
    Computing distribution upgrade...
    
    The following package is going to be upgraded:
      venv-openstack-neutron-x86_64
    
    The following package has no support information from its vendor:
      venv-openstack-neutron-x86_64
    
    1 package to upgrade.
    Overall download size: 64.2 MiB. Already cached: 0 B. After the operation, additional 6.9 KiB will be used.
    Continue? [y/n/...? shows all options] (y): y
    Retrieving package venv-openstack-neutron-x86_64-11.0.2-13.8.1.042171.0.PTF.102473.noarch ... (1/1),  64.2 MiB ( 64.6 MiB unpacked)
    Retrieving: venv-openstack-neutron-x86_64-11.0.2-13.8.1.042171.0.PTF.102473.noarch.rpm ....[done]
    Checking for file conflicts: ..............................................................[done]
    (1/1) Installing: venv-openstack-neutron-x86_64-11.0.2-13.8.1.042171.0.PTF.102473.noarch ....[done]
    Additional rpm output:
    warning
    warning: /var/cache/zypp/packages/PTF/noarch/venv-openstack-neutron-x86_64-11.0.2-13.8.1.042171.0.PTF.102473.noarch.rpm: Header V3 DSA/SHA1 Signature, key ID b37b98a9: NOKEY
  8. Validate the venv tarball has been installed into the deployment directory:(note:the packages file under that dir shows the registered tarballs that will be used for the services, which should align with the installed venv RPM)

    ardana > ls -la /opt/ardana_packager/ardana-9/sles_venv/x86_64
    total 898952
    drwxr-xr-x 2 root root     4096 Oct 30 16:10 .
    ...
    -rw-r--r-- 1 root root 67688160 Oct 30 12:44 neutron-20181030T124310Z.tgz <<<
    -rw-r--r-- 1 root root 64674087 Aug 14 16:14 nova-20180814T161306Z.tgz
    -rw-r--r-- 1 root root 45378897 Aug 14 16:09 octavia-20180814T160839Z.tgz
    -rw-r--r-- 1 root root     1879 Oct 30 16:10 packages
    -rw-r--r-- 1 root root 27186008 Apr 26  2018 swift-20180426T230541Z.tgz
  9. Install the non-venv PTF packages on the Compute Node

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-update-pkgs.yml --extra-vars '{"zypper_update_method": "update", "zypper_update_repositories": ["PTF"]}' --limit comp0001-mgmt

    When it has finished, you can see that the upgraded package has been installed on comp0001-mgmt.

    ardana > sudo zypper se --detail python-neutronclient
    Loading repository data...
    Reading installed packages...
    
    S | Name                 | Type     | Version                         | Arch   | Repository
    --+----------------------+----------+---------------------------------+--------+--------------------------------------
    i | python-neutronclient | package  | 6.5.1-4.361.042171.0.PTF.102473 | noarch | PTF
      | python-neutronclient | package  | 6.5.0-4.361                     | noarch | SUSE-OPENSTACK-CLOUD-x86_64-GM-DVD1
  10. Running the ardana update playbook will distribute the PTF venv packages to the cloud server. Then you can find them loaded in the virtual environment directory with the other venvs.

    The Compute Node before running the update playbook:

    ardana > ls -la /opt/stack/venv
    total 24
    drwxr-xr-x  9 root root 4096 Jul 18 15:47 neutron-20180718T154642Z
    drwxr-xr-x  9 root root 4096 Aug 14 16:13 neutron-20180814T161306Z
    drwxr-xr-x 10 root root 4096 May 28 09:30 nova-20180528T092954Z
    drwxr-xr-x 10 root root 4096 Aug 14 16:13 nova-20180814T161306Z
  11. Run the update.

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-update.yml --limit comp0001-mgmt

    When it has finished, you can see that an additional virtual environment has been installed.

    ardana > ls -la /opt/stack/venv
    total 28
    drwxr-xr-x  9 root root 4096 Jul 18 15:47 neutron-20180718T154642Z
    drwxr-xr-x  9 root root 4096 Aug 14 16:13 neutron-20180814T161306Z
    drwxr-xr-x  9 root root 4096 Oct 30 12:43 neutron-20181030T124310Z <<< New venv installed
    drwxr-xr-x 10 root root 4096 May 28 09:30 nova-20180528T092954Z
    drwxr-xr-x 10 root root 4096 Aug 14 16:13 nova-20180814T161306Z
  12. The PTF may also have RPM package updates in addition to venv updates. To complete the update, follow the instructions at Section 15.3.1, “Performing the Update”

15.6 Periodic OpenStack Maintenance Tasks

Heat-manage helps manage Heat specific database operations. The associated database should be periodically purged to save space. The following should be setup as a cron job on the servers where the heat service is running at /etc/cron.weekly/local-cleanup-heat with the following content:

  #!/bin/bash
  su heat -s /bin/bash -c "/usr/bin/heat-manage purge_deleted -g days 14" || :

nova-manage db archive_deleted_rows command will move deleted rows from production tables to shadow tables. Including --until-complete will make the command run continuously until all deleted rows are archived. It is recommended to setup this task as /etc/cron.weekly/local-cleanup-nova on the servers where the nova service is running, with the following content:

  #!/bin/bash
  su nova -s /bin/bash -c "/usr/bin/nova-manage db archive_deleted_rows --until-complete" || :

16 Operations Console

Often referred to as the Ops Console, you can use this web-based graphical user interface (GUI) to view data about your cloud infrastructure and ensure your cloud is operating correctly.

16.1 Using the Operations Console

16.1.1 Operations Console Overview

Often referred to as the Ops Console, you can use this web-based graphical user interface (GUI) to view data about your cloud infrastructure and ensure your cloud is operating correctly.

You can use the Operations Console for SUSE OpenStack Cloud 9 to view data about your SUSE OpenStack Cloud infrastructure in a web-based graphical user interface (GUI) and ensure your cloud is operating correctly. By logging on to the console, SUSE OpenStack Cloud administrators can manage data in the following ways: Triage alarm notifications.

  • Alarm Definitions and notifications now have their own screens and are collected under the Alarm Explorer menu item which can be accessed from the Central Dashboard. Central Dashboard now allows you to customize the view in the following ways:

    • Rename or re-configure existing alarm cards to include services different from the defaults

    • Create a new alarm card with the services you want to select

    • Reorder alarm cards using drag and drop

    • View all alarms that have no service dimension now grouped in an Uncategorized Alarms card

    • View all alarms that have a service dimension that does not match any of the other cards -now grouped in an Other Alarms card

  • You can also easily access alarm data for a specific component. On the Summary page for the following components, a link is provided to an alarms screen specifically for that component:

16.1.1.1 Monitor the environment by giving priority to alarms that take precedence.

Alarm Explorer now allows you to manage alarms in the following ways:

  • Refine the monitoring environment by creating new alarms to specify a combination of metrics, services, and hosts that match the triggers unique to an environment

  • Filter alarms in one place using an enumerated filter box instead of service badges

  • Specify full alarm IDs as dimension key-value pairs in the form of dimension=value

16.1.1.2 Support Changes

  • To resolve scalability issues, plain text search through alarm sets is no longer supported

The Business Logic Layer of Operations Console is a middleware component that serves as a single point of contact for the user interface to communicate with OpenStack services such as monasca, nova, and others.

16.1.2 Connecting to the Operations Console

Instructions for accessing the Operations Console through a web browser.

To connect to Operations Console, perform the following:

Operations Console will always be accessed over port 9095.

16.1.2.1 Required Access Credentials

In previous versions of Operations Console you were required to have only the password for the Administrator account (admin by default). Now the Administrator user account must also have all of the following credentials:

ProjectDomainRoleDescription
*All projects**not specific*AdminAdmin role on at least one project
*All projects**not specific*AdminAdmin role in default domain
AdmindefaultAdmin or monasca-userAdmin or monasca-user role on admin project
Important
Important

If your login account has administrator role on the administrator project, then you only need to make sure you have the administrator role on the default domain.

Administrator account

During installation, an administrator account called admin is created by default.

Administrator password

During installation, an administrator password is randomly created by default. It is not recommend that you change the default password.

To find the randomized password:

  1. To display the password, log on to the Cloud Lifecycle Manager and run:

    cat ~/service.osrc

16.1.2.2 Connect Through a Browser

The following instructions will show you how to find the URL to access Operations Console. You will use SSH, also known as Secure Socket Shell, which provides administrators with a secure way to access a remote computer.

To access Operations Console:

  1. Log in to the Cloud Lifecycle Manager.

  2. Locate the URL or IP address for the Operations Console with the following command:

    source ~/service.osrc && openstack endpoint list | grep opsconsole | grep admin

    Sample output:

    | 8ef10dd9c00e4abdb18b5b22adc93e87 | region1 | opsconsole | opsconsole | True | admin | https://192.168.24.169:9095/api/v1/

    To access Operations Console, in the sample output, remove everything after port 9095 (api/v1/) and in a browser, type:

    https://192.168.24.169:9095

16.1.2.3 Optionally use a Hostname OR virtual IP address to access Operations Console

Important
Important

If you can access Operations Console using the above instructions, then you can skip this section. These steps provide an alternate way to access Operations Console if the above steps do not work for you.

To find your hostname OR IP address:

  1. Navigate to and open in a text editor the following file:

    network_groups.yml
  2. Find the following entry:

    external-name
  3. If your administrator set a hostname value in the external-name field, you will use that hostname when logging in to Operations Console. or example, in a browser you would type:

    https://VIP:9095
  4. If your administrator did not set a hostname value, then to determine the IP address to use, from your Cloud Lifecycle Manager, run:

    grep HZN-WEB /etc/hosts

    The output of that command will show you the virtual IP address you should use. For example, in a browser you would type:

    https://VIP:9095

16.1.3 Managing Compute Hosts

Operations Console (Ops Console) provides a graphical interface for you to add and delete compute hosts.

As your deployment grows and changes, you may need to add more compute hosts to increase your capacity for VMs, or delete a host to reallocate hardware for a different use. To accomplish these tasks, in previous versions of SUSE OpenStack Cloud you had to use the command line to update configuration files and run ansible playbooks. Now Operations Console provides a graphical interface for you to complete the same tasks quickly using menu items in the console.

Important
Important

Do not refresh the Operations Console page or open Operations Console in another window during the following tasks. If you do, you will not see any notifications or be able to review the error log for more information. This would make troubleshooting difficult since you would not know the error that was encountered, or why it occurred.

Use Operations Console to perform the following tasks:

Important
Important

To use Operations Console, you need to have the correct permissions and know the URL or VIP connected to Operations Console during installation.

16.1.3.1 Create a Compute Host

If you need to create additional compute hosts for more virtual machine capacity, you can do this easily on the Compute Hosts screen.

To add a compute host:

  1. To open Operations Console, in a browser, enter either the URL or Virtual IP connected to Operations Console.

    For example:

    https://myardana.test:9095
    https://VIP:9095
  2. On the Home screen, click the menu represented by 3 horizontal lines (Three-Line Icon).

  3. From the menu that slides in on the left side, click Compute, and then Compute Hosts.

  4. On the Compute Hosts page, click Create Host.

  5. On the Add & Activate Compute Host tab that slides in from the right, enter the following information:

    Host ID

    Cloud Lifecycle Manager model's server ID

    Host Role

    Defined in the Cloud Lifecycle Manager model and cannot be modified in Operations Console

    Host Group

    Defined in the Cloud Lifecycle Manager model and cannot be modified in Operations Console

    Host NIC Mapping

    Defined in the Cloud Lifecycle Manager model and cannot be modified in Operations Console

    Encryption Key

    If the configuration is encrypted, enter the encryption key here

  6. Click Create Host, and in the confirmation screen that opens, click Confirm.

  7. Wait for SUSE OpenStack Cloud to complete the pre deployment steps. This process can take up to 2 minutes.

  8. If pre-deployment is successful, you will see a notification that deployment has started.

    Important
    Important

    If you receive a notice that pre-deployment did not complete successfully, read the notification explaining at which step the error occured. You can click on the error notification and see the ansible log for the configuration processor playbook. Then you can click Create Host in step 4 again and correct the mistake.

  9. Wait for SUSE OpenStack Cloud to complete the deployments steps. This process can take up to 20 minutes.

  10. If deployment is successful, you will see a notification and a new entry will appear in the compute hosts table.

    Important
    Important

    If you receive a notice that deployment did not complete successfully, read the notification explaining at which step the error occured. You can click on the error notification for more details.

16.1.3.2 Deactivate a Compute Host

If you have multiple compute hosts and for debugging reasons you want to disable them all except one, you may need to deactivate and then activate a compute host. If you want to delete a host, you will also have to deactivate it first. This can be done easily in the Operations Console.

Important
Important

The host must be in the following state: ACTIVATED

To deactivate a compute host:

  1. To open Operations Console, in a browser, enter either the URL or Virtual IP connected to Operations Console.

    For example:

    https://myardana.test:9095
    https://VIP:9095
  2. On the Home screen, click the menu represented by 3 horizontal lines (Three-Line Icon).

  3. From the menu that slides in on the left side, click Compute, and then Compute Hosts.

  4. On the Compute Hosts page, in the row for the host you want to deactivate, click the details button (Ellipsis Icon).

  5. Click Deactivate, and in the confirmation screen that opens, click Confirm.

  6. Wait for SUSE OpenStack Cloud to complete the operation. This process can take up to 2 minutes.

  7. If deactivation is successful, you will see a notification and in the compute hosts table the STATE will change to DEACTIVATED.

    Important
    Important

    If you receive a notice that the operation did not complete successfully, read the notification explaining at which step the error occured. You can click on the link in the error notification for more details. In the compute hosts table the STATE will remain ACTIVATED.

16.1.3.3 Activate a Compute Host

Important
Important

The host must be in the following state: DEACTIVATED

To activate a compute host:

  1. To open Operations Console, in a browser, enter either the URL or Virtual IP connected to Operations Console.

    For example:

    https://myardana.test:9095
    https://VIP:9095
  2. On the Home screen, click the menu represented by 3 horizontal lines (Three-Line Icon).

  3. From the menu that slides in on the left side, click Compute, and then Compute Hosts.

  4. On the Compute Hosts page, in the row for the host you want to activate, click the details button (Ellipsis Icon).

  5. Click Activate, and in the confirmation screen that opens, click Confirm.

  6. Wait for SUSE OpenStack Cloud to complete the operation. This process can take up to 2 minutes.

  7. If activation is successful, you will see a notification and in the compute hosts table the STATE will change to ACTIVATED.

    Important
    Important

    If you receive a notice that the operation did not complete successfully, read the notification explaining at which step the error occured. You can click on the link in the error notification for more details. In the compute hosts table the STATE will remain DEACTIVATED.

16.1.3.4 Delete a Compute Host

If you need to scale down the size of your current deployment to use the hardware for other purposes, you may want to delete a compute host.

Important
Important

Complete the following steps before deleting a host:

  • host must be in the following state: DEACTIVATED

  • Optionally you can migrate the instance off the host to be deleted. To do this, complete the following sections in Section 15.1.3.5, “Removing a Compute Node”:

    1. Disable provisioning on the compute host.

    2. Use live migration to move any instances on this host to other hosts.

To delete a compute host:

  1. To open Operations Console, in a browser, enter either the URL or Virtual IP connected to Operations Console.

    For example:

    https://myardana.test:9095
    https://VIP:9095
  2. On the Home screen, click the menu represented by 3 horizontal lines (Three-Line Icon).

  3. From the menu that slides in on the left side, click Compute, and then Compute Hosts.

  4. On the Compute Hosts page, in the row for the host you want to delete, click the details button (Ellipsis Icon).

  5. Click Delete, and if the configuration is encrypted, enter the encryption key.

  6. in the confirmation screen that opens, click Confirm.

  7. In the compute hosts table you will see the STATE change to Deleting.

  8. Wait for SUSE OpenStack Cloud to complete the operation. This process can take up to 2 minutes.

  9. If deletion is successful, you will see a notification and in the compute hosts table the host will not be listed.

    Important
    Important

    If you receive a notice that the operation did not complete successfully, read the notification explaining at which step the error occured. You can click on the link in the error notification for more details. In the compute hosts table the STATE will remain DEACTIVATED.

16.1.3.5 For More Information

For more information on how to complete these tasks through the command line, see the following topics:

16.1.4 Managing Swift Performance

In Operations Console you can monitor your swift cluster to ensure long-term data protection as well as sufficient performance.

OpenStack swift is an object storage solution with a focus on availability. While there are various mechanisms inside swift to protect stored data and ensure a high availability, you must still closely monitor your swift cluster to ensure long-term data protection as well as sufficient performance. The best way to manage swift is to collect useful data that will detect possible performance impacts early on.

The new Object Summary Dashboard in Operations Console provides an overview of your swift environment.

Important
Important

If swift is not installed and configured, you will not be able to access this dashboard. The swift endpoint must be present in keystone for the Object Summary to be present in the menu.

In Operations Console's object storage dashboard, you can easily review the following information:

16.1.4.1 Performance Summary

View a comprehensive summary of current performance values.

To access the object storage performance dashboard:

  1. To open Operations Console, in a browser enter either the URL or Virtual IP connected to Operations Console.

    For example:

    https://myardana.test:9095
    https://VIP:9095
  2. On the Home screen, click the menu represented by 3 horizontal lines (Three-Line Icon).

  3. In the menu, click Storage › Object Storage Summary.

Performance data includes:

Healthcheck Latency from monasca

This latency is the average time it takes for swift to respond to a healthcheck, or ping, request. The swiftlm-uptime monitor program reports the value. A large difference between average and maximum may indicate a problem with one node.

Operational Latency from monasca

Operational latency is the average time it takes for swift to respond to an upload, download, or object delete request. The swiftlm-uptime monitor program reports the value. A large difference between average and maximum may indicate a problem with one node.

Service Availability

This is the availability over the last 24 hours as a percentage.

  • 100% - No outages in the last 24 hours

  • 50% - swift was unavailable for a total of 12 hours in the last 24-hour period

Graph of Performance Over Time

Create a visual representation of performance data to see when swift encountered longer-than-normal response times.

To create a graph:

  1. Choose the length of time you want to graph in Date Range. This sets the length of time for the x-axis which counts backwards until it reaches the present time. In the example below, 1 day is selected, and so the x axis shows performance starting from 24 hours ago (-24) until the present time.

  2. Look at the y-axis to understand the range of response times. The first number is the smallest value in the data collected from the backend, and the last number is the longest amount of time it took swift to respond to a request. In the example below, the shortest time for a response from swift was 16.1 milliseconds.

  3. Look for spikes which represent longer than normal response times. In the example below, swift experienced long response times 21 hours ago and again 1 hour ago.

  4. Look for the latency value at the present time. The line running across the x-axis at 16.1 milliseconds shows you what the response time is currently.

Image

16.1.4.2 Inventory Summary

Monitor details about all the swift resources deployed in your cloud.

To access the object storage inventory screen:

  1. To open Operations Console, in a browser enter either the URL or Virtual IP connected to Operations Console.

    For example:

    https://myardana.test:9095
    https://VIP:9095
  2. On the Home screen, click the menu represented by 3 horizontal lines (Three-Line Icon).

  3. In the menu, click Storage › Object Storage Summary.

  4. On the Summary page, click Inventory Summary.

Image

General swift metrics are available for the following attributes:

  • Time to replicate: The average time in seconds it takes all hosts to complete a replication cycle.

  • Oldest replication: The time in seconds that has elapsed since the object replication process completed its last replication cycle.

  • Async Pending: This is the number of failed requests to add an entry in the container server's database.There is one async queue per swift disk, and a cron job queries all swift servers to calculate the total. When an object is uploaded into swift, and it is successfully stored, a request is sent to the container server to add a new entry for the object in the database. If the container update fails, the request is stored in what swift calls an Async Pending Queue.

    Important
    Important

    On a public cloud deployment, this value can reach millions. If it continues to grow, it means that the container updates are not keeping up with the requests. It is also normal for it this number to grow if a node hosting the swift container service is down.

  • Total number of alarms: This number includes all nodes that host swift services, including proxy, account, container, and object storage services.

  • Total nodes: This number includes all nodes that host swift services, including proxy, account, container, and object storage services. The number in the colored box represents the number of alarms in that state. The following colors are used to show the most severe alarm triggered on all nodes:

    Green

    Indicates all alarms are in a known and untriggered state. For example, if there are 5 nodes and they are all known with no alarms, you will see the number 5 in the green box, and a zero in all the other colored boxes.

    Yellow

    Indicates that some low or medium alarms have been triggered but no critical or high alarms. For example, if there are 5 nodes, and there are 3 nodes with untriggered alarms and 2 nodes with medium severity alarms, you will see the number 3 in the green box, the number 2 in the yellow box, and zeros in all the other colored boxes.

    Red

    Indicates at least one critical or high severity alarm has been triggered on a node. For example, if there are 5 nodes, and there are 3 nodes with untriggered alarms, 1 node with a low severity, and 1 node with a critical alarm, you will see the number 3 in the green box, the number 1 in the yellow box, the number 1 in the red box,and a zero in the gray box.

    Gray

    Indicates that all alarms on the nodes are unknown. For example, if there are 5 nodes with no data reported, you will see the number 5 in the gray box, and zeros in all the other colored boxes.

  • Cluster breakdown of nodes: In the example screen above, the cluster consists of 2 nodes named SWPAC and SWOBJ. Click a node name to bring up more detailed information about that node.

16.1.4.3 Capacity Summary

Use this screen to view the size of the file system space on all nodes and disk drives assigned to swift. Also shown is the remaining space available and the total size of all file systems used by swift. Values are given in megabytes (MB).

To access the object storage alarm summary screen:

  1. To open Operations Console, in a browser enter either the URL or Virtual IP connected to Operations Console.

    For example:

    ardana > https://myardana.test:9095
    https://VIP:9095
  2. On the Home screen, click the menu represented by 3 horizontal lines (Three-Line Icon).

  3. In the menu, click Storage › Object Storage Summary.

  4. On the Summary page, click Capacity Summary.

Image

16.1.4.4 Alarm Summary

Use this page to quickly see the most recent alarms and triage all alarms related to object storage.

To access the object storage alarm summary screen:

  1. To open Operations Console, in a browser enter either the URL or Virtual IP connected to Operations Console.

    For example:

    ardana > https://myardana.test:9095
    https://VIP:9095
  2. On the Home screen, click the menu represented by 3 horizontal lines (Three-Line Icon).

  3. In the menu, click Storage › Object Storage Summary.

  4. On the Summary page, click Alarm Summary.

Image

Each row has a checkbox to allow you to select multiple alarms and set the same condition on them.

The State column displays a graphical indicator representing the state of each alarm:

  • Green indicator: OK. Good operating state.

  • Yellow indicator: Warning. Low severity, not requiring immediate action.

  • Red indicator: Alarm. Varying severity levels and must be addressed.

  • Gray indicator: Undetermined.

The Alarm column identifies the alarm by the name it was given when it was originally created.

The Last Check column displays the date and time the most recent occurrence of the alarm.

The Dimension column describes the components to check in order to clear the alarm.

The last column, depicted by three dots, reveals an Actions menu that allows you to choose:

  • View Details, which opens a separate window that shows all the information from the table view and the alarm history.

    Comments can be updated by clicking Update Comment. Click View Alarm Definition to go to the Alarm Definition tab showing that specific alarm definition.

16.1.5 Visualizing Data in Charts

Operations Console allows you to create a new chart and select the time range and the metric you want to chart, based on monasca metrics.

Present data in a pictorial or graphical format to enable administrators and decision makers to grasp difficult concepts or identify new patterns.

Create new time-series graphs from My Dashboard.

My Dashboard also allows you to customize the view in the following ways:

  • Include alarm cards from the Central Dashboard

  • Customize graphs in new ways

  • Reorder items using drag and drop

Plan for future storage

  • Track capacity over time to predict with some degree of reliability the amount of additional storage needed.

Charts and graphs provide a quick way to visualize large amounts of complex data. It is especially useful when trying to find relationships and understand your data, which could include thousands or even millions of variables. You can create a new chart in Operations Console from My Dashboard.

The charts in Operations Console are based on monasca data. When you create a new chart you will be able to select the time range and the metric you want to chart. The list of Metrics you can choose from is equivalent to using the monasca metric-name-list on the command line. After you select a metric, you can then specify a dimension, which is derived from the monasca metric-list –name <metric_name> command line results. The dimension list changes based on the selected metric.

This topic provides instructions on how to create a basic chart, and how to create a chart specifically to visualize your cinder capacity.

16.1.5.1 Create a Chart

Create a chart to visually display data for up to 6 metrics over a period of time.

To create a chart:

  1. To open Operations Console, in a browser, enter either the URL or Virtual IP connected to Operations Console.

    For example:

    https://myardana.test:9095
    https://VIP:9095
  2. On the Home screen, click the menu represented by 3 horizontal lines (Three-Line Icon).

  3. From the menu that slides in on the left, select Home, and then select My Dashboard.

  4. On the My Dashboard screen, select Create New Chart.

  5. On the Add New Time Series Chart screen, in Chart Definition complete any of the optional fields:

    Name

    Short description of chart.

    Time Range

    Specifies the interval between metric collection. The default is 1 hour. Can be set to hours (1,2,4,8,24) or days (7,30,45).

    Chart Update Rate

    Collects metric data and adds it to the chart at the specified interval. The default is 1 minute. Can be set to minutes (1,5,10,30) or 1 hour.

    Chart Type

    Determines how the data is displayed. The default type is Line. Can be set to the following values:

    • Image

      Line

    • Image

      Bar

    • Image

      Stacked Bar

    • Image

      Area

    • Image

      Stacked Area

    Chart Size

    This controls the visual display of the chart width as it appears on My Dashboard. The default is Small. This field can be set to Small to display it at 50% or Large for 100%.

  6. On the Add New Time Series Chart screen, in Added Chart Data complete the following fields:

    Metric

    In monasca, a metric is a multi-dimensional description that consists of the following fields: name, dimensions, timestamp, value and value_meta. The pre-populated list is equivalent to using the monasca metric-name-list on the command line.

    Dimension

    The set of unique dimensions that are defined for a specific metric. Dimensions are a dictionary of key-value pairs. This pre-populated list is equivalent to using the monasca metric-list –name <metric_name> on the command line.

    Function

    Operations Console uses monasca to provide the results of all mathematical functions. monasca in turns uses Graphite to perform the mathematical calculations and return the results. The default is AVG. The Function field can be set to AVG (default), MIN, MAX. and COUNT. For more information on these functions, see the Graphite documentation at http://www.aosabook.org/en/graphite.html.

  7. Click Add Data To Chart. To add another metric to the chart, repeat the previous step until all metrics are added. The maximum you can have in one chart is 6 metrics.

  8. To create the chart, click Create New Chart.

After you click Create New Chart, you will be returned to My Dashboard where the new chart will be shown. From the My Dashboard screen you can use the menu in the top-right corner of the card to delete or edit the chart. You can also select an option to create a comma-delimited file of the data in the chart.

16.1.5.2 Chart cinder Capacity

To visualize the use of storage capacity over time, you can create a chart that graphs the total block storage backend capacity. To find out how much of that total is being used, you can also create a chart that graphs the available block storage capacity.

Visualizing cinder:

Important
Important

The total and free capacity values are based on the available capacity reported by the cinder backend. Be aware that some backends can be configured to thinly provision.

16.1.5.3 Chart Total Capacity

To chart the total block-storage backend capacity:

  1. Log in to Operations Console.

  2. Follow the steps in the previous instructions to start creating a chart.

  3. To chart the total backend capacity, on the Add New Time Series Chart screen, in Chart Definition use the following settings:

    FieldSetting
    Metricscinderlm.cinder.backend.total.size
    Dimension

    any hostname. If multiple backends are available, select any one. The backends will all return the same metric data.

  4. Add the data to the chart and click Create.

Example of a cinder Total Capacity Chart:

16.1.5.4 Chart Available Capacity

To chart the available block-storage backend capacity:

  1. Log in to Operations Console.

  2. Follow the steps in the previous instructions to start creating a chart.

  3. To chart the available backend capacity, on the Add New Time Series Chart screen, in Chart Definition use the following settings:

    FieldSetting
    Metricscinderlm.cinder.backend.total.avail
    Dimension

    any hostname. If multiple backends are available, select any one. The backends will all return the same metric data.

  4. Add the data to the chart and click Create.

Example of a chart showing cinder Available Capacity:

Important
Important

The source data for the Capacity Summary pages is only refreshed at the top of each hour. This affects the latency of the displayed data on those pages.

16.1.6 Getting Help with the Operations Console

On each of the Operations Console pages there is a help menu that you can click on to take you to a help page specific to the console you are currently viewing.

To reach the help page:

  1. Click the help menu option in the upper-right corner of the page, depicted by the question mark seen in the screenshot below.

  2. Click the Get Help For This Page link which will open the help page in a new tab in your browser.

Image

16.2 Alarm Definition

The Alarm Definition section under Monitoring allows you to define alarms that are useful in generating notifications and metrics required by your organization. By default, alarm definitions are sorted by name and in a table format.

16.2.1 Filter and Sort

The search feature allows you to search and filter alarm entries by name and description.

The check box above the top left of the table is used to select all alarm definitions on the current page.

To sort the table, click the desired column header. To reverse the sort order, click the column again.

16.2.2 Create Alarm Definitions

The Create Alarm Definition button next to the search bar allows you to create a new alarm definition.

To create a new alarm definition:

  1. Click Create Alarm Definition to open the Create Alarm Definition dialog.

  2. In the Create Alarm Definition window, type a name for the alarm in the Name text field. The name is mandatory and can be up to 255 characters long. The name can include letters, numbers, and special characters.

  3. Provide a short description of the alarm in the Description text field (optional).

  4. Select the desired severity level of the alarm from the Severity drop-down box. The severity level is subjective, so choose the level appropriate for prioritizing the handling of alarms when they occur.

  5. Although not required, in order to specify how to receive notifications, you must be able to select the method(s) of notification (Email, Web, API, etc.) from the list of options in the Alarm Notifications area. If none are available to choose from, you must first configure them in the Notifications Methods window. Refer to the Notification Methods help page for further instructions.

  6. To enable notifications for the alarm, enable the check box next to the desired alarm notification method.

  7. Apply the following rules to your alarm by using the Alarm Expression form:

    • Function: determines the output value from a supplied input value.

    • Metric: applies a pre-defined means of measuring whatever aspect of the alarm.

    • Dimension(s): identifies which aspect (Hostname, Region, and Service) of the alarm you want to monitor.

    • Comparator: specifies the operator for how you want the alarm to trigger.

    • Threshold: determines the numeric threshold associated with the operator you specified.

  8. Match By (optional): group results by a specific dimension that is not part of the Dimension(s) solution.

  9. To save the changes and add the new alarm definition to the table, click Create Alarm Definition.

16.3 Alarm Explorer

This page displays the alarms for all services and appliances. By default, alarms are sorted by their state.

16.3.1 Filter and Sort

Using the Filter Alarms button, you can filter the alarms by their IDs and dimensions. The Filter Alarms dialog lets you configure a filtering rule using the Alarm ID field and options in the Dimension(s) section.

You can display the alarms by grid, list or table views by selecting the corresponding icons next to the Sort By control.

To sort the alarm list, click the desired column header. To reverse the sort order, click the column again.

16.3.2 Alarm Table

Each row has a checkbox to allow you to select multiple alarms and set the same condition on them.

The Status column displays a graphical indicator that shows the state of each alarm:

  • Green indicator: OK. Good operating state.

  • Yellow indicator: Warning. Low severity, not requiring immediate action.

  • Red indicator: Alarm. Varying severity levels and must be addressed.

  • Gray indicator: Unknown.

The Alarm column identifies the alarm by the name it was given when it was originally created.

The Last Check column displays the date and time the most recent occurrence of the alarm.

The Dimension column describes the components to check in order to clear the alarm.

16.3.3 Notification Methods

The Notification Methods section of the Alarm Explorer allows you to define notification methods that are used by the alarms. By default, notification methods are sorted by name.

16.3.3.1 Filter and Sort

The filter bar allows you to filter the notification methods by specifying a filter criteria. You can sort the available notification methods by clicking on the desired column header in the table.

16.3.3.2 Create Notification Methods

The Create Notification Methods button beside the search bar allows you to create a new notification method.

To create a new notification method:

  1. Click the Create Notification Method button.

  2. In the Create Notification Method window, specify a name for the notification in the Name text field. The name is required, and it can be up to 255 characters in length, consisting of letters, numbers, or special characters.

  3. Select a Type in the drop down and select the desired option:

    • Web Hook allows you to enter in an internet address, also referred to as a Web Hook.

    • Email allows you to enter in an email address. For this method to work you need to have a SMTP server specified.

    • PagerDuty allows you to enter in a PagerDuty address.

  4. In the Address/Key text field, provide the required values.

  5. Press Create Notification Method, and you should see the created notification method in the table.

16.4 Compute Hosts

This Compute Hosts page in the Compute section allows you to view your Compute Host resources.

16.4.1 Filter and Sort

The dedicated bar at the top of the page bar lets you filter alarm entries using the available filtering options.

Compute Hosts
Figure 16.1: Compute Hosts

Click the Filter icon to select one of the available options:

  • Any Column enables plain search across all columns

  • Status filters alarm entries by status.

  • Type enables filtering by host type, including Hyper-V, KVM, ESXi, and VMWare vCenter server.

  • State filters alarm entries by nova state (for example, Activated, Activating, Imported, etc.).

  • Alarm State filters entries bay status of the alarms that are triggered on the host.

  • Cluster returns a filtered list of configured clusters that Compute Hosts belong to.

The alarm entries can be sorted by clicking on the appropriate column header, such as Name, Status, Type, State, etc.

To view detailed information (including alarm counts and utilization metrics) about a specific host in the list, click in the host's name in the list.

16.5 Compute Instances

This Operations Console page allows you to monitor your Compute instances.

16.5.1 Search and Sort

The search bar allows you to filter the alarm definitions you want to view. Type and Status are examples of alarm criteria that can be specified. Additionally, you can filter by typing in text similar to searching by keywords.

The checkbox allows you to select (or deselect) a group of alarm definitions to delete:

  • Select Visible allows you to delete the selected alarm definitions from the table.

  • Select All allows you to delete all the alarms from the table.

  • Clear Selection allows you to clear all the selections currently selected from the table.

You can display the alarm definitions by grid, list or table views by selecting the corresponding icons next to the Sort By control.

The Sort By control contains a drop-down list of ways by which you can sort the compute nodes. Alternatively, you can also sort using the column headers in the table.

  • Sort by Name displays the compute instances by the name assigned to it when it was created.

  • Sort by State displays the compute instances by their current state.

  • Sort by Status displays the compute instances by their current status.

  • Sort by Host displays the compute instances by their host.

  • Sort by Image displays the compute instances by the image being used.

  • Sort by IP Address displays the compute instances by their IP address.

16.6 Compute Summary

The Compute Summary page in the Compute section gives you access to inventory, capacity, and alarm summaries.

16.6.1 Inventory Summary

The Inventory Summary section provides an overview of compute alarms by status. These alarms are grouped by control plane. There is also information on resource usage for each compute host. Here you can also see alarms triggered on individual compute hosts.

Compute Summary
Figure 16.2: Compute Summary

16.6.2 Capacity Summary

Capacity Summary offers an overview of the utilization of physical resources and allocation of virtual resources among compute nodes. Here you will also find a break-down of CPU, memory, and storage usage across all compute resources in the cloud.

16.6.3 Compute Summary

The Compute Summary show overviews of new alarms as well as a list of all alarms that can be filtered and sorted. For more information on filtering alarms, see Section 16.3, “Alarm Explorer”.

16.6.4 Appliances

This page displays details of an appliance.

Search and Sort

  • The search bar allows you to filter the appliances you want to view. Role and Status are examples of criteria that can be specified. Additionally, you can filter by selecting Any Column and typing in text similar to searching by keywords.

  • You can sort using the column headers in the table.

Actions

Click the Action icon (three dots) to view details of an appliance.

16.6.5 Block Storage Summary

This page displays the alarms that have triggered since the timeframe indicated.

Search and Sort

  • The search bar allows you to filter the alarms you want to view. State and Service are examples of criteria that can be specified. Additionally, you can filter by typing in text similar to searching by keywords.

  • You can sort alarm entries using the column headers in the table.

New Alarms: Block Storage

The New Alarms section shows you the alarms that have triggered since the timeframe indicated. You can select the timeframe using the Configure control with options ranging from the Last Minute to Last 30 Days. This section refreshes every 60 seconds.

The new alarms will be separated into the following categories:

CategoryDescription
CriticalOpen alarms, identified by red indicator.
WarningOpen alarms, identified by yellow indicator.
Unknown

Open alarms, identified by gray indicator. Unknown will be the status of an alarm that has stopped receiving a metric. This can be caused by the following conditions:

  • An alarm exists for a service or component that is not installed in the environment.

  • An alarm exists for a virtual machine or node that previously existed but has been removed without the corresponding alarms being removed.

  • There is a gap between the last reported metric and the next metric.

OpenComplete list of open alarms.
Total

Complete list of alarms, may include Acknowledged and Resolved alarms.

More Information

16.7 Logging

This page displays the link to the Logging Interface, known as Kibana.

Important
Important: Accessing Kibana

The Kibana logging interface only runs on the management network. You need to have access to that network to be able to use Kibana.

16.7.1 View Logging Interface

To access the logging interface, click the Launch Logging Interface button, which will open the interface in a new window.

For more details about the logging interface, see Section 13.2, “Centralized Logging Service”.

16.8 My Dashboard

This page allows you to customize the dashboard by mixing and matching graphs and alarm cards.

My Dashboard allows you to customize the dashboard by mixing and matching graphs and alarm cards. Since different operators may be interested in different metrics and alarms, the configuration for this page is tied to the login account used to access Operations Console. Charts available here are based on metrics collected by the monasca monitoring component.

16.9 Networking Alarm Summary

This page displays the alarms for the Networking (neutron), DNS, Firewall, and Load Balancing services. By default, alarms are sorted by State.

16.9.1 Filter and Sort

The filter bar allows you to filter the alarms by the available criteria, including Dimension, State, and Service. The dimension filter accepts key/value pairs, while the State filter provides a selection of valid values.

You can sort alarm entries using the column headers in the table.

16.9.2 Alarm Table

You can select one or multiple alarms using the check box next to each entry.

The State column displays a graphical indicator that shows the state of each alarm:

  • Green indicator: OK. Good operating state.

  • Yellow indicator: Warning. Low severity, not requiring immediate action.

  • Red indicator: Alarm. Varying severity levels and must be addressed.

  • Gray square (or gray indicator): Undetermined.

The Alarm column identifies the alarm by its name.

The Last Check column displays the date and time the most recent occurrence of the alarm.

The Dimension column shows the components to check in order to clear the alarm.

The last column, depicted by three dots, reveals an Actions menu gives you access to the following options:

  • View Details opens a separate window with the information from the table view and the alarm history.

  • View Alarm Definition allows you to view and edit the selected alarm definition.

  • Delete is used to delete the currently selected alarm entry.

16.10 Central Dashboard

This page displays a high level overview of all cloud resources and their alarm status.

16.10.1 Central Dashboard

Image

16.10.2 New Alarms

The New Alarms section shows you the alarms that have triggered since the timeframe indicated. You can select the timeframe using the View control with options ranging from the Last Minute to Last 30 Days. This section refreshes every 60 seconds.

The new alarms will be separated into the following categories:

  • Critical - Open alarms, identified by red indicator.

  • Warning - Open alarms, identified by yellow indicator.

  • Unknown - Open alarms, identified by gray indicator. Unknown will be the status of an alarm that has stopped receiving a metric. This can be caused by the following conditions:

    • An alarm exists for a service or component that is not installed in the environment.

    • An alarm exists for a virtual machine or node that previously existed but has been removed without the corresponding alarms being removed.

    • There is a gap between the last reported metric and the next metric.

  • Open - Complete list of open alarms.

  • Total - Complete list of alarms, may include Acknowledged and Resolved alarms.

16.10.3 Alarm Summary

Each service or group of services have a dedicated card displaying related alarms.

  • Critical - Open alarms, identified by red indicator.

  • Warning - Open alarms, identified by yellow indicator.

  • Unknown - Open alarms, identified by gray indicator. Unknown will be the status of an alarm that has stopped receiving a metric. This can be caused by the following conditions:

    • An alarm exists for a service or component that is not installed in the environment.

    • An alarm exists for a virtual machine or node that previously existed but has been removed without the corresponding alarms being removed.

    • There is a gap between the last reported metric and the next metric.

  • Open - Complete list of open alarms.

  • Total - Complete list of alarms, may include Acknowledged and Resolved alarms.

17 Backup and Restore

The following sections cover backup and restore operations. Before installing your cloud, there are several things you must do so that you achieve the backup and recovery results you need. SUSE OpenStack Cloud comes with playbooks and procedures to recover the control plane from various disaster scenarios.

As of SUSE OpenStack Cloud 9, Freezer (a distributed backup restore and disaster recovery service) is no longer supported; backup and restore are manual operations.

Consider Section 17.2, “Enabling Backups to a Remote Server” in case you lose cloud servers that back up and restore services.

The following features are supported:

  • File system backup using a point-in-time snapshot.

  • Strong encryption: AES-256-CFB.

  • MariaDB database backup with LVM snapshot.

  • Restoring your data from a previous backup.

  • Low storage requirement: backups are stored as compressed files.

  • Flexible backup (both incremental and differential).

  • Data is archived in GNU Tar format for file-based incremental backup and restore.

  • When a key is provided, Open SSL is used to encrypt data (AES-256-CFB).

17.1 Manual Backup Overview

This section covers manual backup and some restore processes. Full documentation for restore operations is in Section 15.2, “Unplanned System Maintenance”.To back up outside the cluster, refer to Section 17.2, “Enabling Backups to a Remote Server”. Backups of the following types of resources are covered:

  • Cloud Lifecycle Manager Data.  All important information on the Cloud Lifecycle Manager

  • MariaDB database that is part of the Control Plane.  The MariaDB database contains most of the data needed to restore services. MariaDB supports full back up and recovery for all services. Logging data in Elasticsearch is not backed up. swift objects are not backed up because of the redundant nature of swift.

  • swift Rings used in the swift storage deployment.  swift rings are backed up so that you can recover more quickly than rebuilding with swift. swift can rebuild the rings without this backup data, but automatically rebuilding the rings is slower than restoring from a backup.

  • Audit Logs.  Audit Logs are backed up to provide retrospective information and statistical data for performance and security purposes.

The following services will be backed up. Specifically, the data needed to restore the services is backed up. This includes databases and configuration-related files.

Important
Important

Data content for some services is not backed up, as indicated below.

  • ceilometer. There is no backup of metrics data.

  • cinder. There is no backup of the volumes.

  • glance. There is no backup of the images.

  • heat

  • horizon

  • keystone

  • neutron

  • nova. There is no backup of the images.

  • swift. There is no backup of the objects. swift has its own high availability and redundancy. swift rings are backed up. Although swift can rebuild the rings itself, restoring from backup is faster.

  • Operations Console

  • monasca. There is no backup of the metrics.

17.2 Enabling Backups to a Remote Server

We recommend that you set up a remote server to store your backups, so that you can restore the control plane nodes. This may be necessary if you lose all of your control plane nodes at the same time.

Important
Important

A remote backup server must be set up before proceeding.

You do not have to restore from the remote server if only one or two control plane nodes are lost. In that case, the control planes can be recovered from the data on a remaining control plane node following the restore procedures in Section 15.2.3.2, “Recovering the Control Plane”.

17.2.1 Securing your SSH backup server

You can do the following to harden an SSH server:

  • Disable root login

  • Move SSH to a non-default port (the default SSH port is 22)

  • Disable password login (only allow RSA keys)

  • Disable SSH v1

  • Authorize Secure File Transfer Protocol (SFTP) only for the designated backup user (disable SSH shell)

  • Firewall SSH traffic to ensure it comes from the SUSE OpenStack Cloud address range

  • Install a Fail2Ban solution

  • Restrict users who are allowed to SSH

  • Additional suggestions are available online

Remove the key pair generated earlier on the backup server; the only thing needed is .ssh/authorized_keys. You can remove the .ssh/id_rsa and .ssh/id_rsa.pub files. Be sure to save a backup of them.

17.2.2 General tips

  • Provide adequate space in the directory that is used for backup.

  • Monitor the space left on that directory.

  • Keep the system up to date on that server.

17.3 Manual Backup and Restore Procedures

Each backup requires the following steps:

  1. Create a snapshot.

  2. Mount the snapshot.

  3. Generate a TAR archive and save it.

  4. Unmount and delete the snapshot.

17.3.1 Cloud Lifecycle Manager Data Backup

The following procedure is used for each of the seven BACKUP_TARGETS (list below). Incremental backup instructions follow the full backup procedure. For both full and incremental backups, the last step of the procedure is to unmount and delete the snapshot after the TAR archive has been created and saved. A new snapshot must be created every time a backup is created.

Procedure 17.1: Manual Backup Setup
  1. Create a snapshot on the Cloud Lifecycle Manager in (ardana-vg), the location where all Cloud Lifecycle Manager data is stored.

    ardana > sudo lvcreate --size 2G --snapshot --permission r \
    --name lvm_clm_snapshot /dev/ardana-vg/root
    Note
    Note

    If you have stored additional data or files in your ardana-vg directory, you may need more space than the 2G indicated for the size parameter. In this situation, create a preliminary TAR archive with the tar command on the directory before creating a snapshot. Set the size snapshot parameter larger than the size of the archive.

  2. Mount the snapshot

    ardana > sudo mkdir /var/tmp/clm_snapshot
    ardana > sudo mount -o ro /dev/ardana-vg/lvm_clm_snapshot /var/tmp/clm_snapshot
  3. Generate a TAR archive (does not apply to incremental backups) with an appropriate BACKUP_TAR_ARCHIVE_NAME.tar.gz backup file for each of the following BACKUP_TARGETS.

    Backup Targets

    • home

    • ssh

    • shadow

    • passwd

    • group

    The backup TAR archive should contain only the necessary data; nothing extra. Some of the archives will be stored as directories, others as files. The backup commands are slightly different for each type.

    If the BACKUP_TARGET is a directory, then that directory must be appended to /var/tmp/clm_snapshot/TARGET_DIR. If the BACKUP_TARGET is a file, then its parent directory must be appended to /var/tmp/clm_snapshot/.

    In the commands that follow, replace BACKUP_TARGET with the appropriate BACKUP_PATH (replacement table is below).

    ardana > sudo tar --create -z --warning=none --no-check-device \
    --one-file-system --preserve-permissions --same-owner --seek \
    --ignore-failed-read --file BACKUP_TAR_ARCHIVE_NAME.tar.gz -C \
    /var/tmp/clm_snapshotTARGET_DIR|BACKUP_TARGET_WITHOUT_LEADING_DIR
    • If BACKUP_TARGET is a directory, replace TARGET_DIR with BACKUP_PATH.

      For example, where BACKUP_PATH=/etc/ssh/ (a directory):

      ardana > sudo tar --create -z --warning=none --no-check-device \
      --one-file-system --preserve-permissions --same-owner --seek \
      --ignore-failed-read --file ssh.tar.gz -C /var/tmp/clm_snapshot/etc/ssh .
    • If BACKUP_TARGET is a file (not a directory), replace TARGET_DIR with the parent directory of BACKUP_PATH.

      For example, where BACKUP_PATH=/etc/passwd (a file):

      ardana > sudo tar --create -z --warning=none --no-check-device \
      --one-file-system --preserve-permissions --same-owner --seek \
      --ignore-failed-read --file passwd.tar.gz -C /var/tmp/clm_snapshot/etc/passwd
  4. Save the TAR archive to the remote server.

    ardana > scp TAR_ARCHIVE USER@REMOTE_SERVER
  5. Use the following commands to unmount and delete a snapshot.

    ardana > sudo umount -l -f /var/tmp/clm_snapshot; rm -rf /var/tmp/clm_snapshot
    ardana > sudo lvremove -f /dev/ardana-vg/lvm_clm_snapshot

The table below shows Cloud Lifecycle Manager backup_targets and their respective backup_paths.

Table 17.1: Cloud Lifecycle Manager Backup Paths

backup_name

backup_path

home_backup

/var/lib/ardana (file)

etc_ssh_backup

/etc/ssh/ (directory)

shadow_backup

/etc/shadow (file)

passwd_backup

/etc/passwd (file)

group_backup

/etc/group (file)

17.3.1.1 Cloud Lifecycle Manager Incremental Backup

Incremental backups require a meta file. If you use the incremental backup option, a meta file must be included in the tar command in the initial backup and whenever you do an incremental backup. A copy of the original meta file should be stored in each backup. The meta file is used to determine the incremental changes from the previous backup, so it is rewritten with each incremental backup.

Versions are useful for incremental backup because they provide a way to differentiate between each backup. Versions are included in the tar command.

Every incremental backup requires creating and mounting a separate snapshot. After the TAR archive is created, the snapshot is unmounted and deleted.

To prepare for incremental backup, follow the steps in Procedure 17.1, “Manual Backup Setup” with the following differences in the commands for generating a tar archive.

  • First time full backup

    ardana > sudo tar --create -z --warning=none --no-check-device \
    --one-file-system --preserve-permissions --same-owner --seek \
    --ignore-failed-read --listed-incremental=PATH_TO_YOUR_META \
    --file BACKUP_TAR_ARCHIVE_NAME.tar.gz -C \
    /var/tmp/clm_snapshotTARGET_DIR|BACKUP_TARGET_WITHOUT_LEADING_DIR

    For example, where BACKUP_PATH=/etc/ssh/

    ardana > sudo tar --create -z --warning=none --no-check-device \
    --one-file-system --preserve-permissions --same-owner --seek --ignore-failed-read \
    --listed-incremental=mysshMeta --file ssh.tar.gz -C \
    /var/tmp/clm_snapshot/etc/ssh .
  • Incremental backup

    ardana > sudo tar --create -z --warning=none --no-check-device \
    --one-file-system --preserve-permissions --same-owner --seek \
    --ignore-failed-read --listed-incremental=PATH_TO_YOUR_META\
    --file BACKUP_TAR_ARCHIVE_NAME_VERSION.tar.gz -C \
    /var/tmp/clm_snapshotTARGET_DIR|BACKUP_TARGET_WITHOUT_LEADING_DIR

    For example, where BACKUP_PATH=/etc/ssh/:

    ardana > sudo tar --create -z --warning=none --no-check-device \
    --one-file-system --preserve-permissions --same-owner --seek --ignore-failed-read \
    --listed-incremental=mysshMeta --file \
    ssh_v1.tar.gz -C \
    /var/tmp/clm_snapshot/etc/ssh .

After creating an incremental backup, use the following commands to unmount and delete a snapshot.

ardana > sudo umount -l -f /var/tmp/clm_snapshot; rm -rf /var/tmp/clm_snapshot
ardana > sudo lvremove -f /dev/ardana-vg/lvm_clm_snapshot

17.3.1.2 Encryption

When a key is provided, Open SSL is used to encrypt data (AES-256-CFB). Backup files can be encrypted with the following command:

ardana > sudo openssl enc -aes-256-cfb -pass file:ENCRYPT_PASS_FILE_PATH -in \
YOUR_BACKUP_TAR_ARCHIVE_NAME.tar.gz -out YOUR_BACKUP_TAR_ARCHIVE_NAME.tar.gz.enc

For example, using the ssh.tar.gz generated above:

ardana > sudo openssl enc  -aes-256-cfb -pass file:myEncFile -in ssh.tar.gz  -out ssh.tar.gz.enc

17.3.2 MariaDB Database Backup

When backing up MariaDB, the following process must be performed on all nodes in the cluster. It is similar to the backup procedure above for the Cloud Lifecycle Manager (see Procedure 17.1, “Manual Backup Setup”). The difference is the addition of SQL commands, which are run with the create_db_snapshot.yml playbook.

Create the create_db_snapshot.yml file in ~/scratch/ansible/next/ardana/ansible/ on the deployer with the following content:

- hosts: FND-MDB
vars:
 - snapshot_name: lvm_mysql_snapshot
 - lvm_target: /dev/ardana-vg/mysql

 tasks:
 - name: Cleanup old snapshots
   become: yes
   shell: |
    lvremove -f /dev/ardana-vg/{{ snapshot_name }}
   ignore_errors: True

 - name: Create snapshot
   become: yes
   shell: |
    lvcreate --size 2G --snapshot --permission r --name {{ snapshot_name }} {{ lvm_target }}
   register: snapshot_st
   ignore_errors: True

 - fail:
     msg: "Fail to create snapshot on  {{ lvm_target }}"
   when: snapshot_st.rc != 0
Note
Note

Verify the validity of the lvm_target variable (which refers to the actual database LVM volume) before proceeding with the backup.

Doing the MariaDB backup

  1. We recommend storing the MariaDB version with your backup. The following command saves the MariaDB version as MARIADB_VER.

    mysql -V | grep -Eo '(\S+?)-MariaDB' > MARIADB_VER
  2. Open a MariaDB client session on all controllers.

  3. Run the command to spread read lock on all controllers and keep the MariaDB session open.

    >> FLUSH TABLES WITH READ LOCK;
  4. Open a new terminal and run the create_db_snapshot.yml playbook created above.

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts create_db_snapshot.yml
  5. Go back to the open MariaDB session and run the command to flush the lock on all controllers.

    >> UNLOCK TABLES;
  6. Mount the snapshot

    dbnode>> mkdir /var/tmp/mysql_snapshot
    dbnode>> sudo mount -o ro /dev/ardana-vg/lvm_mysql_snapshot  /var/tmp/mysql_snapshot
  7. On each database node, generate a TAR archive with an appropriate BACKUP_TAR_ARCHIVE_NAME.tar.gz backup file for the BACKUP_TARGET.

    The backup_name is mysql_backup and the backup_path (BACKUP_TARGET) is /var/lib/mysql/.

    dbnode>> sudo tar --create -z --warning=none --no-check-device \
    --one-file-system --preserve-permissions --same-owner --seek --ignore-failed-read \
    --file mydb.tar.gz /var/tmp/mysql_snapshot/var/lib/mysql .
  8. Unmount and delete the MariaDB snapshot on each database node.

    dbnode>> sudo  umount -l -f /var/tmp/mysql_snapshot; \
    sudo rm -rf /var/tmp/mysql_snapshot; sudo lvremove -f /dev/ardana-vg/lvm_mysql_snapshot

17.3.2.1 Incremental MariaDB Database Backup

Incremental backups require a meta file. If you use the incremental backup option, a meta file must be included in the tar command in the initial backup and whenever you do an incremental backup. A copy of the original meta file should be stored in each backup. The meta file is used to determine the incremental changes from the previous backup, so it is rewritten with each incremental backup.

Versions are useful for incremental backup because they provide a way to differentiate between each backup. Versions are included in the tar command.

To prepare for incremental backup, follow the steps in the previous section except for the tar commands. Incremental backup tar commands must have additional information.

  • First time MariaDB database full backup

    dbnode>> sudo tar --create -z --warning=none --no-check-device \
    --one-file-system --preserve-permissions --same-owner --seek \
    --ignore-failed-read --listed-incremental=PATH_TO_YOUR_DB_META \
    --file mydb.tar.gz -C /var/tmp/mysql_snapshot/var/lib/mysql .

    For example, where BACKUP_PATH=/var/lib/mysql/:

    dbnode>> sudo tar --create -z --warning=none --no-check-device \
    --one-file-system --preserve-permissions --same-owner --seek --ignore-failed-read \
    --listed-incremental=mydbMeta --file mydb.tar.gz -C \
    /var/tmp/mysql_snapshot/var/lib/mysql .
  • Incremental MariaDB database backup

    dbnode>> sudo tar --create -z --warning=none --no-check-device \
    --one-file-system --preserve-permissions --same-owner --seek \
    --ignore-failed-read --listed-incremental=PATH_TO_YOUR_META\
    --file BACKUP_TAR_ARCHIVE_NAME_VERSION.tar.gz -C \
    /var/tmp/clm_snapshotTARGET_DIR

    For example, where BACKUP_PATH=/var/lib/mysql/:

    dbnode>> sudo tar --create -z --warning=none --no-check-device \
    --one-file-system --preserve-permissions --same-owner --seek --ignore-failed-read \
    --listed-incremental=mydbMeta --file \
    mydb_v1.tar.gz -C /var/tmp/mysql_snapshot/var/lib/mysql .

After creating and saving the TAR archive, unmount and delete the snapshot.

dbnode>> sudo  umount -l -f /var/tmp/mysql_snapshot; \
sudo rm -rf /var/tmp/mysql_snapshot; sudo lvremove -f /dev/ardana-vg/lvm_mysql_snapshot

17.3.2.2 MariaDB Database Encryption

  1. Encrypt your MariaDB database backup following the instructions in Section 17.3.1.2, “Encryption”

  2. Upload your BACKUP_TARGET.tar.gz to your preferred remote server.

17.3.3 swift Ring Backup

The following procedure is used to back up swift rings. It is similar to the Cloud Lifecycle Manager backup (see Procedure 17.1, “Manual Backup Setup”).

Important
Important

The steps must be performed only on the building server (For more information, see Section 18.6.2.4, “Identifying the Swift Ring Building Server”.).

The backup_name is swift_builder_dir_backup and the backup_path is /etc/swiftlm/.

  1. Create a snapshot

    ardana > sudo lvcreate --size 2G --snapshot --permission r \
    --name lvm_root_snapshot /dev/ardana-vg/root
  2. Mount the snapshot

    ardana > mkdir /var/tmp/root_snapshot; sudo mount -o ro \
    /dev/ardana-vg/lvm_root_snapshot /var/tmp/root_snapshot
  3. Create the TAR archive

    ardana > sudo tar --create -z --warning=none --no-check-device \
    --one-file-system --preserve-permissions --same-owner --seek --ignore-failed-read \
    --file swring.tar.gz -C /var/tmp/root_snapshot/etc/swiftlm .
  4. Upload your swring.tar.gz TAR archive to your preferred remote server.

  5. Unmount and delete the snapshot

    ardana > sudo umount -l -f /var/tmp/root_snapshot; sudo rm -rf \
    /var/tmp/root_snapshot; sudo lvremove -f /dev/ardana-vg/lvm_root_snapshot

17.3.4 Audit Log Backup and Restore

17.3.4.1 Audit Log Backup

The following procedure is used to back up Audit Logs. It is similar to the Cloud Lifecycle Manager backup (see Procedure 17.1, “Manual Backup Setup”). The steps must be performed on all nodes; there will be a backup TAR archive for each node. Before performing the following steps, run through Section 13.2.7.2, “Enable Audit Logging” .

The backup_name is audit_log_backup and the backup_path is /var/audit.

  1. Create a snapshot

    ardana > sudo lvcreate --size 2G --snapshot --permission r --name \
    lvm_root_snapshot /dev/ardana-vg/root
  2. Mount the snapshot

    ardana > mkdir /var/tmp/root_snapshot; sudo mount -o ro \
    /dev/ardana-vg/lvm_root_snapshot /var/tmp/root_snapshot
  3. Create the TAR archive

    ardana > sudo tar --create -z --warning=none --no-check-device \
    --one-file-system --preserve-permissions --same-owner --seek --ignore-failed-read \
    --file audit.tar.gz -C /var/tmp/root_snapshot/var/audit .
  4. Upload your audit.tar.gz TAR archive to your preferred remote server.

  5. Unmount and delete a snapshot

    ardana > sudo umount -l -f /var/tmp/root_snapshot; sudo rm -rf \
    /var/tmp/root_snapshot; sudo lvremove -f /dev/ardana-vg/lvm_root_snapshot

17.3.4.2 Audit Logs Restore

Restore the Audit Logs backup with the following commands

  1. Retrieve the Audit Logs TAR archive

  2. Extract the TAR archive to the proper backup location

    ardana > sudo tar -z --incremental --extract --ignore-zeros \
    --warning=none --overwrite --directory /var/audit/  -f audit.tar.gz

17.4 Full Disaster Recovery Test

Full Disaster Recovery Test

17.4.1 High Level View of the Recovery Process

  1. Back up the control plane using the manual backup procedure

  2. Backup the Cassandra Database

  3. Re-install Controller 1 with the SUSE OpenStack Cloud ISO

  4. Use manual restore steps to recover deployment data (and model)

  5. Re-install SUSE OpenStack Cloud on Controllers 1, 2, 3

  6. Recover the backup of the MariaDB database

  7. Recover the Cassandra Database

  8. Verify testing

17.4.2 Description of the testing environment

The testing environment is similar to the Entry Scale model.

It uses five servers: three Control Nodes and two Compute Nodes.

The controller node has three disks. The first is reserved for the system; the others are used for swift.

Note
Note

For this Disaster Recovery test, data has been saved on disks 2 and 3 of the swift controllers, which allows for swift objects to be restored the recovery. If these disks were also wiped, swift data would be lost, but the procedure would not change. The only difference is that glance images would be lost and would have to be uploaded again.

Unless specified otherwise, all commands should be executed on controller 1, which is also the deployer node.

17.4.3 Pre-Disaster testing

In order to validate the procedure after recovery, we need to create some workloads.

  1. Source the service credential file

    ardana > source ~/service.osrc
  2. Copy an image to the platform and create a glance image with it. In this example, Cirros is used

    ardana > openstack image create --disk-format raw --container-format \
    bare --public --file ~/cirros-0.3.5-x86_64-disk.img cirros
  3. Create a network

    ardana > openstack network create test_net
  4. Create a subnet

    ardana > openstack subnet create 07c35d11-13f9-41d4-8289-fa92147b1d44 192.168.42.0/24 --name test_subnet
  5. Create some instances

    ardana > openstack server create server_1 --image 411a0363-7f4b-4bbc-889c-b9614e2da52e --flavor m1.small --nic net-id=07c35d11-13f9-41d4-8289-fa92147b1d44
    ardana > openstack server create server_2 --image 411a03...e2da52e --flavor m1.small --nic net-id=07c35d...147b1d44
    ardana > openstack server create server_3 --image 411a03...e2da52e --flavor m1.small --nic net-id=07c35d...147b1d44
    ardana > openstack server create server_4 --image 411a03...e2da52e --flavor m1.small --nic net-id=07c35d...147b1d44
    ardana > openstack server create server_5 --image 411a03...e2da52e --flavor m1.small --nic net-id=07c35d...147b1d44
    ardana > openstack server list
  6. Create containers and objects

    ardana > openstack object create container_1 ~/service.osrc
    var/lib/ardana/service.osrc
    
    ardana > openstack object create container_1 ~/backup.osrc
    swift upload container_1 ~/backup.osrc
    
    ardana > openstack object list container_1
    var/lib/ardana/backup.osrc
    var/lib/ardana/service.osrc

17.4.4 Preparation of the test backup server

17.4.4.1 Preparation to store backups

In this example, backups are stored on the server 192.168.69.132

  1. Connect to the backup server

  2. Create the user

    root # useradd BACKUPUSER --create-home --home-dir /mnt/backups/
  3. Switch to that user

    root # su BACKUPUSER
  4. Create the SSH keypair

    backupuser > ssh-keygen -t rsa
    > # Leave the default for the first question and do not set any passphrase
    > Generating public/private rsa key pair.
    > Enter file in which to save the key (/mnt/backups//.ssh/id_rsa):
    > Created directory '/mnt/backups//.ssh'.
    > Enter passphrase (empty for no passphrase):
    > Enter same passphrase again:
    > Your identification has been saved in /mnt/backups//.ssh/id_rsa
    > Your public key has been saved in /mnt/backups//.ssh/id_rsa.pub
    > The key fingerprint is:
    > a9:08:ae:ee:3c:57:62:31:d2:52:77:a7:4e:37:d1:28 backupuser@padawan-ccp-c0-m1-mgmt
    > The key's randomart image is:
    > +---[RSA 2048]----+
    > |          o      |
    > |   . . E + .     |
    > |  o . . + .      |
    > | o +   o +       |
    > |  + o o S .      |
    > | . + o o         |
    > |  o + .          |
    > |.o .             |
    > |++o              |
    > +-----------------+
  5. Add the public key to the list of the keys authorized to connect to that user on this server

    backupuser > cat /mnt/backups/.ssh/id_rsa.pub >> /mnt/backups/.ssh/authorized_keys
  6. Print the private key. This will be used for the backup configuration (ssh_credentials.yml file)

    backupuser > cat /mnt/backups/.ssh/id_rsa
    
    > -----BEGIN RSA PRIVATE KEY-----
    > MIIEogIBAAKCAQEAvjwKu6f940IVGHpUj3ffl3eKXACgVr3L5s9UJnb15+zV3K5L
    > BZuor8MLvwtskSkgdXNrpPZhNCsWSkryJff5I335Jhr/e5o03Yy+RqIMrJAIa0X5
    > ...
    > ...
    > ...
    > iBKVKGPhOnn4ve3dDqy3q7fS5sivTqCrpaYtByJmPrcJNjb2K7VMLNvgLamK/AbL
    > qpSTZjicKZCCl+J2+8lrKAaDWqWtIjSUs29kCL78QmaPOgEvfsw=
    > -----END RSA PRIVATE KEY-----

17.4.4.2 Preparation to store Cassandra backups

In this example, backups will be stored on the server 192.168.69.132, in the /mnt/backups/cassandra_backups/ directory.

  1. Create a directory on the backup server to store Cassandra backups.

    backupuser > mkdir /mnt/backups/cassandra_backups
  2. Copy the private SSH key from the backup server to all controller nodes.

    Replace CONTROLLER with each control node e.g. doc-cp1-c1-m1-mgmt, doc-cp1-c1-m2-mgmt etc

  3. Log in to each controller node and copy the private SSH key to .ssh directory of the root user.

    ardana >  sudo cp /var/lib/ardana/.ssh/id_rsa_backup /root/.ssh/
  4. Verify that you can SSH to the backup server as backupuser using the private key.

    root # ssh -i ~/.ssh/id_rsa_backup backupuser@192.168.69.132

17.4.5 Perform Backups for disaster recovery test

17.4.5.1 Execute backup of Cassandra

Create the following cassandra-backup-extserver.sh script on all controller nodes.

root # cat > ~/cassandra-backup-extserver.sh << EOF
#!/bin/sh

# backup user
BACKUP_USER=backupuser
# backup server
BACKUP_SERVER=192.168.69.132
# backup directory
BACKUP_DIR=/mnt/backups/cassandra_backups/

# Setup variables
DATA_DIR=/var/cassandra/data/data
NODETOOL=/usr/bin/nodetool

# example: cassandra-snp-2018-06-26-1003
SNAPSHOT_NAME=cassandra-snp-\$(date +%F-%H%M)
HOST_NAME=\$(/bin/hostname)_

# Take a snapshot of Cassandra database
\$NODETOOL snapshot -t \$SNAPSHOT_NAME monasca

# Collect a list of directories that make up the snapshot
SNAPSHOT_DIR_LIST=\$(find \$DATA_DIR -type d -name \$SNAPSHOT_NAME)
for d in \$SNAPSHOT_DIR_LIST
  do
    # copy snapshot directories to external server
    rsync -avR -e "ssh -i /root/.ssh/id_rsa_backup" \$d \$BACKUP_USER@\$BACKUP_SERVER:\$BACKUP_DIR/\$HOST_NAME\$SNAPSHOT_NAME
  done

\$NODETOOL clearsnapshot monasca
EOF
root # chmod +x ~/cassandra-backup-extserver.sh

Execute following steps on all the controller nodes

Note
Note

The /usr/local/sbin/cassandra-backup-extserver.sh script should be executed on all three controller nodes at the same time (within seconds of each other) for a successful backup.

  1. Edit the /usr/local/sbin/cassandra-backup-extserver.sh script

    Set BACKUP_USER and BACKUP_SERVER to the desired backup user (for example, backupuser) and desired backup server (for example, 192.168.68.132), respectively.

    BACKUP_USER=backupuser
    BACKUP_SERVER=192.168.69.132
    BACKUP_DIR=/mnt/backups/cassandra_backups/
  2. Execute ~/cassandra-backup-extserver.sh on on all controller nodes which are also Cassandra nodes.

    root # ~/cassandra-backup-extserver.sh
    
    Requested creating snapshot(s) for [monasca] with snapshot name [cassandra-snp-2018-06-28-0251] and options {skipFlush=false}
    Snapshot directory: cassandra-snp-2018-06-28-0251
    sending incremental file list
    created directory /mnt/backups/cassandra_backups//doc-cp1-c1-m1-mgmt_cassandra-snp-2018-06-28-0251
    /var/
    /var/cassandra/
    /var/cassandra/data/
    /var/cassandra/data/data/
    /var/cassandra/data/data/monasca/
    
    ...
    ...
    ...
    
    /var/cassandra/data/data/monasca/measurements-e29033d0488d11e8bdabc32666406af1/snapshots/cassandra-snp-2018-06-28-0306/mc-72-big-Summary.db
    /var/cassandra/data/data/monasca/measurements-e29033d0488d11e8bdabc32666406af1/snapshots/cassandra-snp-2018-06-28-0306/mc-72-big-TOC.txt
    /var/cassandra/data/data/monasca/measurements-e29033d0488d11e8bdabc32666406af1/snapshots/cassandra-snp-2018-06-28-0306/schema.cql
    sent 173,691 bytes  received 531 bytes  116,148.00 bytes/sec
    total size is 171,378  speedup is 0.98
    Requested clearing snapshot(s) for [monasca]
  3. Verify the Cassandra backup directory on the backup server.

    backupuser > ls -alt /mnt/backups/cassandra_backups
    total 16
    drwxr-xr-x 4 backupuser users 4096 Jun 28 03:06 .
    drwxr-xr-x 3 backupuser users 4096 Jun 28 03:06 doc-cp1-c1-m2-mgmt_cassandra-snp-2018-06-28-0306
    drwxr-xr-x 3 backupuser users 4096 Jun 28 02:51 doc-cp1-c1-m1-mgmt_cassandra-snp-2018-06-28-0251
    drwxr-xr-x 8 backupuser users 4096 Jun 27 20:56 ..
    
    $backupuser@backupserver> du -shx /mnt/backups/cassandra_backups/*
    6.2G    /mnt/backups/cassandra_backups/doc-cp1-c1-m1-mgmt_cassandra-snp-2018-06-28-0251
    6.3G    /mnt/backups/cassandra_backups/doc-cp1-c1-m2-mgmt_cassandra-snp-2018-06-28-0306

17.4.5.2 Execute backup of SUSE OpenStack Cloud

  1. Back up the Cloud Lifecycle Manager using the procedure at Section 17.3.1, “Cloud Lifecycle Manager Data Backup”

  2. Back up the MariaDB database using the procedure at Section 17.3.2, “MariaDB Database Backup”

  3. Back up swift rings using the procedure at Section 17.3.3, “swift Ring Backup”

17.4.5.2.1 Restore the first controller
  1. Log in to the Cloud Lifecycle Manager.

  2. Retrieve the Cloud Lifecycle Manager backups that were created with Section 17.3.1, “Cloud Lifecycle Manager Data Backup”. There are multiple backups; directories are handled differently than files.

  3. Extract the TAR archives for each of the seven locations.

    ardana > sudo tar -z --incremental --extract --ignore-zeros \
    --warning=none --overwrite --directory RESTORE_TARGET \
    -f BACKUP_TARGET.tar.gz

    For example, with a directory such as BACKUP_TARGET=/etc/ssh/

    ardana > sudo tar -z --incremental --extract --ignore-zeros \
    --warning=none --overwrite --directory /etc/ssh/ -f ssh.tar.gz

    With a file such as BACKUP_TARGET=/etc/passwd

    ardana > sudo tar -z --incremental --extract --ignore-zeros --warning=none --overwrite --directory /etc/ -f passwd.tar.gz
17.4.5.2.2 Re-deployment of controllers 1, 2 and 3
  1. Change back to the default ardana user.

  2. Run the cobbler-deploy.yml playbook.

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost cobbler-deploy.xml
  3. Run the bm-reimage.yml playbook limited to the second and third controllers.

    ardana > ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=controller2,controller3

    The names of controller2 and controller3. Use the bm-power-status.yml playbook to check the cobbler names of these nodes.

  4. Run the site.yml playbook limited to the three controllers and localhost—in this example, doc-cp1-c1-m1-mgmt, doc-cp1-c1-m2-mgmt, doc-cp1-c1-m3-mgmt, and localhost

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts site.yml --limit \
    doc-cp1-c1-m1-mgmt,doc-cp1-c1-m2-mgmt,doc-cp1-c1-m3-mgmt,localhost
17.4.5.2.3 Restore Databases
17.4.5.2.3.1 Restore MariaDB database
  1. Log in to the first controller node.

  2. Retrieve the MariaDB backup that was created with Section 17.3.2, “MariaDB Database Backup”.

  3. Create a temporary directory and extract the TAR archive (for example, mydb.tar.gz).

    ardana > mkdir /tmp/mysql_restore; sudo tar -z --incremental \
    --extract --ignore-zeros --warning=none --overwrite --directory /tmp/mysql_restore/ \
    -f mydb.tar.gz
  4. Verify that the files have been restored on the controller.

    ardana > sudo du -shx /tmp/mysql_restore/*
    16K     /tmp/mysql_restore/aria_log.00000001
    4.0K    /tmp/mysql_restore/aria_log_control
    3.4M    /tmp/mysql_restore/barbican
    8.0K    /tmp/mysql_restore/ceilometer
    4.2M    /tmp/mysql_restore/cinder
    2.9M    /tmp/mysql_restore/designate
    129M    /tmp/mysql_restore/galera.cache
    2.1M    /tmp/mysql_restore/glance
    4.0K    /tmp/mysql_restore/grastate.dat
    4.0K    /tmp/mysql_restore/gvwstate.dat
    2.6M    /tmp/mysql_restore/heat
    752K    /tmp/mysql_restore/horizon
    4.0K    /tmp/mysql_restore/ib_buffer_pool
    76M     /tmp/mysql_restore/ibdata1
    128M    /tmp/mysql_restore/ib_logfile0
    128M    /tmp/mysql_restore/ib_logfile1
    12M     /tmp/mysql_restore/ibtmp1
    16K     /tmp/mysql_restore/innobackup.backup.log
    313M    /tmp/mysql_restore/keystone
    716K    /tmp/mysql_restore/magnum
    12M     /tmp/mysql_restore/mon
    8.3M    /tmp/mysql_restore/monasca_transform
    0       /tmp/mysql_restore/multi-master.info
    11M     /tmp/mysql_restore/mysql
    4.0K    /tmp/mysql_restore/mysql_upgrade_info
    14M     /tmp/mysql_restore/nova
    4.4M    /tmp/mysql_restore/nova_api
    14M     /tmp/mysql_restore/nova_cell0
    3.6M    /tmp/mysql_restore/octavia
    208K    /tmp/mysql_restore/opsconsole
    38M     /tmp/mysql_restore/ovs_neutron
    8.0K    /tmp/mysql_restore/performance_schema
    24K     /tmp/mysql_restore/tc.log
    4.0K    /tmp/mysql_restore/test
    8.0K    /tmp/mysql_restore/winchester
    4.0K    /tmp/mysql_restore/xtrabackup_galera_info
  5. Stop SUSE OpenStack Cloud services on the three controllers (using the hostnames of the controllers in your configuration).

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-stop.yml --limit \
    doc-cp1-c1-m1-mgmt,doc-cp1-c1-m2-mgmt,doc-cp1-c1-m3-mgmt,localhost
  6. Delete the files in the mysql directory and copy the restored backup to that directory.

    root # cd /var/lib/mysql/
    root # rm -rf ./*
    root # cp -pr /tmp/mysql_restore/* ./
  7. Switch back to the ardana user when the copy is finished.

17.4.5.2.3.2 Restore Cassandra database

Create a script called cassandra-restore-extserver.sh on all controller nodes

root # cat > ~/cassandra-restore-extserver.sh << EOF
#!/bin/sh

# backup user
BACKUP_USER=backupuser
# backup server
BACKUP_SERVER=192.168.69.132
# backup directory
BACKUP_DIR=/mnt/backups/cassandra_backups/

# Setup variables
DATA_DIR=/var/cassandra/data/data
NODETOOL=/usr/bin/nodetool

HOST_NAME=\$(/bin/hostname)_

#Get snapshot name from command line.
if [ -z "\$*"  ]
then
  echo "usage \$0 <snapshot to restore>"
  exit 1
fi
SNAPSHOT_NAME=\$1

# restore
rsync -av -e "ssh -i /root/.ssh/id_rsa_backup" \$BACKUP_USER@\$BACKUP_SERVER:\$BACKUP_DIR/\$HOST_NAME\$SNAPSHOT_NAME/ /

# set ownership of newley restored files
chown -R cassandra:cassandra \$DATA_DIR/monasca/*

# Get a list of snapshot directories that have files to be restored.
RESTORE_LIST=\$(find \$DATA_DIR -type d -name \$SNAPSHOT_NAME)

# use RESTORE_LIST to move snapshot files back into place of database.
for d in \$RESTORE_LIST
do
  cd \$d
  mv * ../..
  KEYSPACE=\$(pwd | rev | cut -d '/' -f4 | rev)
  TABLE_NAME=\$(pwd | rev | cut -d '/' -f3 |rev | cut -d '-' -f1)
  \$NODETOOL refresh \$KEYSPACE \$TABLE_NAME
done
cd
# Cleanup snapshot directories
\$NODETOOL clearsnapshot \$KEYSPACE
EOF
root # chmod +x ~/cassandra-restore-extserver.sh

Execute following steps on all the controller nodes.

  1. Edit the ~/cassandra-restore-extserver.sh script.

    Set BACKUP_USER,BACKUP_SERVER to the desired backup user (for example, backupuser) and the desired backup server (for example, 192.168.68.132), respectively.

    BACKUP_USER=backupuser
    BACKUP_SERVER=192.168.69.132
    BACKUP_DIR=/mnt/backups/cassandra_backups/
  2. Execute ~/cassandra-restore-extserver.sh SNAPSHOT_NAME

    Find SNAPSHOT_NAME from listing of /mnt/backups/cassandra_backups. All the directories have the format HOST_SNAPSHOT_NAME.

    ardana > ls -alt /mnt/backups/cassandra_backups
    total 16
    drwxr-xr-x 4 backupuser users 4096 Jun 28 03:06 .
    drwxr-xr-x 3 backupuser users 4096 Jun 28 03:06 doc-cp1-c1-m2-mgmt_cassandra-snp-2018-06-28-0306
    root # ~/cassandra-restore-extserver.sh cassandra-snp-2018-06-28-0306
    
    receiving incremental file list
    ./
    var/
    var/cassandra/
    var/cassandra/data/
    var/cassandra/data/data/
    var/cassandra/data/data/monasca/
    var/cassandra/data/data/monasca/alarm_state_history-e6bbdc20488d11e8bdabc32666406af1/
    var/cassandra/data/data/monasca/alarm_state_history-e6bbdc20488d11e8bdabc32666406af1/snapshots/
    var/cassandra/data/data/monasca/alarm_state_history-e6bbdc20488d11e8bdabc32666406af1/snapshots/cassandra-snp-2018-06-28-0306/
    var/cassandra/data/data/monasca/alarm_state_history-e6bbdc20488d11e8bdabc32666406af1/snapshots/cassandra-snp-2018-06-28-0306/manifest.json
    var/cassandra/data/data/monasca/alarm_state_history-e6bbdc20488d11e8bdabc32666406af1/snapshots/cassandra-snp-2018-06-28-0306/mc-37-big-CompressionInfo.db
    var/cassandra/data/data/monasca/alarm_state_history-e6bbdc20488d11e8bdabc32666406af1/snapshots/cassandra-snp-2018-06-28-0306/mc-37-big-Data.db
    ...
    ...
    ...
    /usr/bin/nodetool clearsnapshot monasca
17.4.5.2.3.3 Restart SUSE OpenStack Cloud services
  1. Restart the MariaDB database

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts galera-bootstrap.yml

    On the deployer node, execute the galera-bootstrap.yml playbook which will determine the log sequence number, bootstrap the main node, and start the database cluster.

    If this process fails to recover the database cluster, refer to Section 15.2.3.1.2, “Recovering the MariaDB Database”.

  2. Restart SUSE OpenStack Cloud services on the three controllers as in the following example.

    ardana > ansible-playbook -i hosts/verb_hosts ardana-start.yml \
    --limit doc-cp1-c1-m1-mgmt,doc-cp1-c1-m2-mgmt,doc-cp1-c1-m3-mgmt,localhost
  3. Reconfigure SUSE OpenStack Cloud

    ardana > ansible-playbook -i hosts/verb_hosts ardana-reconfigure.yml
17.4.5.2.4 Post restore testing
  1. Source the service credential file

    ardana > source ~/service.osrc
  2. swift

    ardana > openstack container list
    container_1
    volumebackups
    
    ardana > openstack object list container_1
    var/lib/ardana/backup.osrc
    var/lib/ardana/service.osrc
    
    ardana > openstack object save container_1 /tmp/backup.osrc
  3. neutron

    ardana > openstack network list
    +--------------------------------------+---------------------+--------------------------------------+
    | ID                                   | Name                | Subnets                              |
    +--------------------------------------+---------------------+--------------------------------------+
    | 07c35d11-13f9-41d4-8289-fa92147b1d44 | test-net             | 02d5ca3b-1133-4a74-a9ab-1f1dc2853ec8|
    +--------------------------------------+---------------------+--------------------------------------+
  4. glance

    ardana > openstack image list
    +--------------------------------------+----------------------+--------+
    | ID                                   | Name                 | Status |
    +--------------------------------------+----------------------+--------+
    | 411a0363-7f4b-4bbc-889c-b9614e2da52e | cirros-0.4.0-x86_64  | active |
    +--------------------------------------+----------------------+--------+
    ardana > openstack image save --file /tmp/cirros f751c39b-f1e3-4f02-8332-3886826889ba
    ardana > ls -lah /tmp/cirros
    -rw-r--r-- 1 ardana ardana 12716032 Jul  2 20:52 /tmp/cirros
  5. nova

    ardana > openstack server list
    
    ardana > openstack server create server_6 --image 411a0363-7f4b-4bbc-889c-b9614e2da52e  --flavor m1.small --nic net-id=07c35d11-13f9-41d4-8289-fa92147b1d44
    +-------------------------------------+------------------------------------------------------------+
    | Field                               | Value                                                      |
    +-------------------------------------+------------------------------------------------------------+
    | OS-DCF:diskConfig                   | MANUAL                                                     |
    | OS-EXT-AZ:availability_zone         |                                                            |
    | OS-EXT-SRV-ATTR:host                | None                                                       |
    | OS-EXT-SRV-ATTR:hypervisor_hostname | None                                                       |
    | OS-EXT-SRV-ATTR:instance_name       |                                                            |
    | OS-EXT-STS:power_state              | NOSTATE                                                    |
    | OS-EXT-STS:task_state               | scheduling                                                 |
    | OS-EXT-STS:vm_state                 | building                                                   |
    | OS-SRV-USG:launched_at              | None                                                       |
    | OS-SRV-USG:terminated_at            | None                                                       |
    | accessIPv4                          |                                                            |
    | accessIPv6                          |                                                            |
    | addresses                           |                                                            |
    | adminPass                           | iJBoBaj53oUd                                               |
    | config_drive                        |                                                            |
    | created                             | 2018-07-02T21:02:01Z                                       |
    | flavor                              | m1.small (2)                                               |
    | hostId                              |                                                            |
    | id                                  | ce7689ff-23bf-4fe9-b2a9-922d4aa9412c                       |
    | image                               | cirros-0.4.0-x86_64 (f751c39b-f1e3-4f02-8332-3886826889ba) |
    | key_name                            | None                                                       |
    | name                                | server_6                                                   |
    | progress                            | 0                                                          |
    | project_id                          | cca416004124432592b2949a5c5d9949                           |
    | properties                          |                                                            |
    | security_groups                     | name='default'                                             |
    | status                              | BUILD                                                      |
    | updated                             | 2018-07-02T21:02:01Z                                       |
    | user_id                             | 8cb1168776d24390b44c3aaa0720b532                           |
    | volumes_attached                    |                                                            |
    +-------------------------------------+------------------------------------------------------------+
    
    ardana > openstack server list
    +--------------------------------------+----------+--------+---------------------------------+---------------------+-----------+
    | ID                                   | Name     | Status | Networks                        | Image               | Flavor    |
    +--------------------------------------+----------+--------+---------------------------------+---------------------+-----------+
    | ce7689ff-23bf-4fe9-b2a9-922d4aa9412c | server_6 | ACTIVE | n1=1.1.1.8                      | cirros-0.4.0-x86_64 | m1.small  |
    
    ardana > openstack server delete ce7689ff-23bf-4fe9-b2a9-922d4aa9412c

18 Troubleshooting Issues

Troubleshooting and support processes for solving issues in your environment.

This section contains troubleshooting tasks for your SUSE OpenStack Cloud cloud.

18.1 General Troubleshooting

General troubleshooting procedures for resolving your cloud issues including steps for resolving service alarms and support contact information.

Before contacting support to help you with a problem on SUSE OpenStack Cloud, we recommend gathering as much information as possible about your system and the problem. For this purpose, SUSE OpenStack Cloud ships with a tool called supportconfig. It gathers system information such as the current kernel version being used, the hardware, RPM database, partitions, and other items. supportconfig also collects the most important log files. This information assists support staff to identify and solve your problem.

Always run supportconfig on the Cloud Lifecycle Manager and on the Control Node(s). If a Compute Node or a Storage Node is part of the problem, run supportconfig on the affected node as well. For details on how to run supportconfig, see https://documentation.suse.com/sles/15-SP1/single-html/SLES-admin/#cha-adm-support.

18.1.1 Alarm Resolution Procedures

SUSE OpenStack Cloud provides a monitoring solution based on OpenStack’s monasca service. This service provides monitoring and metrics for all OpenStack components, as well as much of the underlying system. By default, SUSE OpenStack Cloud comes with a set of alarms that provide coverage of the primary systems. In addition, you can define alarms based on threshold values for any metrics defined in the system. You can view alarm information in the Operations Console. You can also receive or deliver this information to others by configuring email or other mechanisms. Alarms provide information about whether a component failed and is affecting the system, and also what condition triggered the alarm.

Here is a list of the included service-specific alarms and the recommended troubleshooting steps. We have organized these alarms by the section of the SUSE OpenStack Cloud Operations Console, they are organized in as well as the service dimension defined.

18.1.1.1 Compute Alarms

These alarms show under the Compute section of the SUSE OpenStack Cloud Operations Console.

18.1.1.1.1 SERVICE: COMPUTE
Alarm InformationMitigation Tasks

Name: HTTP Status

Description: This is a nova-api health check.

Likely cause: Process crashed.

Restart the nova-api process on the affected node. Review the nova-api.log files. Try to connect locally to the http port that is found in the dimension field of the alarm to see if the connection is accepted.

Name: Host Status

Description:: Alarms when the specified host is down or not reachable.

Likely cause: The host is down, has been rebooted, or has network connectivity issues.

If it is a single host, attempt to restart the system. If it is multiple hosts, investigate networking issues.

Name: Process Bound Check

Description:: process_name=nova-api This alarm checks that the number of processes found is in a predefined range.

Likely cause: Process crashed or too many processes running

Stop all the processes and restart the nova-api process on the affected host. Review the system and nova-api logs.

Name: Process Check

Description:: Separate alarms for each of these nova services, specified by the component dimension:

  • nova-api

  • nova-cert

  • nova-compute

  • nova-conductor

  • nova-scheduler

  • nova-novncproxy

Likely cause: Process specified by the component dimension has crashed on the host specified by the hostname dimension.

Restart the process on the affected node using these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Use the nova start playbook against the affected node:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts nova-start.yml \
    --limit <hostname>

Review the associated logs. The logs will be in the format of <service>.log, such as nova-compute.log or nova-scheduler.log.

Name: nova.heartbeat

Description:: Check that all services are sending heartbeats.

Likely cause: Process for service specified in the alarm has crashed or is hung and not reporting its status to the database. Alternatively it may be the service is fine but an issue with messaging or the database which means the status is not being updated correctly.

Restart the affected service. If the service is reporting OK the issue may be with RabbitMQ or MySQL. In that case, check the alarms for those services.

Name: Service Log Directory Size

Description:: Service log directory consuming more disk than its quota.

Likely cause: This could be due to a service set to DEBUG instead of INFO level. Another reason could be due to a repeating error message filling up the log files. Finally, it could be due to log rotate not configured properly so old log files are not being deleted properly.

Find the service that is consuming too much disk space. Look at the logs. If DEBUG log entries exist, set the logging level to INFO. If the logs are repeatedly logging an error message, do what is needed to resolve the error. If old log files exist, configure log rotate to remove them. You could also choose to remove old log files by hand after backing them up if needed.
18.1.1.1.2 SERVICE: IMAGE-SERVICE in Compute section
Alarm InformationMitigation Tasks

Name: HTTP Status

Description:: Separate alarms for each of these glance services, specified by the component dimension:

  • glance-api

  • glance-registry

Likely cause: API is unresponsive.

Restart the process on the affected node using these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Use the glance start playbook against the affected node:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts glance-start.yml \
    --limit <hostname>

Review the associated logs.

Name: Service Log Directory Size

Description:: Service log directory consuming more disk than its quota.

Likely cause: This could be due to a service set to DEBUG instead of INFO level. Another reason could be due to a repeating error message filling up the log files. Finally, it could be due to log rotate not configured properly so old log files are not being deleted properly.

Find the service that is consuming too much disk space. Look at the logs. If DEBUG log entries exist, set the logging level to INFO. If the logs are repeatedly logging an error message, do what is needed to resolve the error. If old log files exist, configure log rotate to remove them. You could also choose to remove old log files by hand after backing them up if needed.
18.1.1.1.3 SERVICE: BAREMETAL in Compute section
Alarm InformationMitigation Tasks

Name: Process Check

Description: Alarms when the specified process is not running: process_name = ironic-api

Likely cause: The ironic API is unresponsive.

Restart the ironic-api process with these steps:

  1. Log in to the affected host via SSH.

  2. Restart the ironic-api process with this command:

    sudo service ironic-api restart

Name: Process Check

Description: Alarms when the specified process is not running: process_name = ironic-conductor

Likely cause: The ironic-conductor process has crashed.

Restart the ironic-conductor process with these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Source your admin user credentials:

    source ~/service.osrc
  3. Locate the messaging_deployer VM:

    openstack server list --all-tenants | grep mess
  4. SSH to the messaging_deployer VM:

    sudo -u ardana ssh <IP_ADDRESS>
  5. Stop the ironic-conductor process by using this playbook:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts ironic-stop.yml
  6. Start the process back up again, effectively restarting it, by using this playbook:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts ironic-start.yml

Name: HTTP Status

Description: Alarms when the specified HTTP endpoint is down or not reachable.

Likely cause: The API is unresponsive.

  1. Log in to the Cloud Lifecycle Manager.

  2. Source your admin user credentials:

    source ~/service.osrc
  3. Locate the messaging_deployer VM:

    openstack server list --all-tenants | grep mess
  4. SSH to the messaging_deployer VM:

    sudo -u ardana ssh <IP_ADDRESS>
  5. Stop the ironic-api process by using this playbook:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts ironic-stop.yml
  6. Start the process back up again, effectively restarting it, by using this playbook:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts ironic-start.yml

Name: Service Log Directory Size

Description: Service log directory consuming more disk than its quota.

Likely cause: This could be due to a service set to DEBUG instead of INFO level. Another reason could be due to a repeating error message filling up the log files. Finally, it could be due to log rotate not configured properly so old log files are not being deleted properly.

Find the service that is consuming too much disk space. Look at the logs. If DEBUG log entries exist, set the logging level to INFO. If the logs are repeatedly logging an error message, do what is needed to resolve the error. If old log files exist, configure log rotate to remove them. You could also choose to remove old log files by hand after backing them up if needed.

18.1.1.2 Storage Alarms

These alarms show under the Storage section of the SUSE OpenStack Cloud Operations Console.

18.1.1.2.1 SERVICE: OBJECT-STORAGE
Alarm InformationMitigation Tasks

Name: swiftlm-scan monitor

Description: Alarms if swiftlm-scan cannot execute a monitoring task.

Likely cause: The swiftlm-scan program is used to monitor and measure a number of metrics. If it is unable to monitor or measure something, it raises this alarm.

Click on the alarm to examine the Details field and look for a msg field. The text may explain the error problem. To view/confirm this, you can also log into the host specified by the hostname dimension, and then run this command:

sudo swiftlm-scan | python -mjson.tool

The msg field is contained in the value_meta item.

Name: swift account replicator last completed in 12 hours

Description: Alarms if an account-replicator process did not complete a replication cycle within the last 12 hours.

Likely cause: This can indicate that the account-replication process is stuck.

Another cause of this problem may be that a file system may be corrupt. Look for sign of this in these logs on the affected node:

/var/log/swift/swift.log
/var/log/kern.log

The file system may need to be wiped, contact Sales Engineering for advice on the best way to do that if needed. You can then reformat the file system with these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Run the swift deploy playbook against the affected node, which will format the wiped file system:

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts swift-deploy.yml
    --limit <hostname>

Name: swift container replicator last completed in 12 hours

Description: Alarms if a container-replicator process did not complete a replication cycle within the last 12 hours

Likely cause: This can indicate that the container-replication process is stuck.

SSH to the affected host and restart the process with this command:

sudo systemctl restart swift-container-replicator

Another cause of this problem may be that a file system may be corrupt. Look for sign of this in these logs on the affected node:

/var/log/swift/swift.log
/var/log/kern.log

The file system may need to be wiped, contact Sales Engineering for advice on the best way to do that if needed. You can then reformat the file system with these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Run the swift deploy playbook against the affected node, which will format the wiped file system:

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts swift-deploy.yml \
    --limit <hostname>

Name: swift object replicator last completed in 24 hours

Description: Alarms if an object-replicator process did not complete a replication cycle within the last 24 hours

Likely cause: This can indicate that the object-replication process is stuck.

SSH to the affected host and restart the process with this command:

sudo systemctl restart swift-account-replicator

Another cause of this problem may be that a file system may be corrupt. Look for sign of this in these logs on the affected node:

/var/log/swift/swift.log
/var/log/kern.log

The file system may need to be wiped, contact Sales Engineering for advice on the best way to do that if needed. You can then reformat the file system with these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Run the swift deploy playbook against the affected node, which will format the wiped file system:

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts swift-deploy.yml \
    --limit <hostname>

Name: swift configuration file ownership

Description: Alarms if files/directories in /etc/swift are not owned by swift.

Likely cause: For files in /etc/swift, somebody may have manually edited or created a file.

For files in /etc/swift, use this command to change the file ownership:

ardana > sudo chown swift.swift /etc/swift/, /etc/swift/*

Name: swift data filesystem ownership

Description: Alarms if files or directories in /srv/node are not owned by swift.

Likely cause: For directories in /srv/node/*, it may happen that the root partition was reimaged or reinstalled and the UID assigned to the swift user change. The directories and files would then not be owned by the UID assigned to the swift user.

For directories and files in /srv/node/*, compare the swift UID of this system and other systems and the UID of the owner of /srv/node/*. If possible, make the UID of the swift user match the directories or files. Otherwise, change the ownership of all files and directories under the /srv/node path using a similar chown swift.swift command as above.

Name: Drive URE errors detected

Description: Alarms if swift-drive-audit reports an unrecoverable read error on a drive used by the swift service.

Likely cause: An unrecoverable read error occurred when swift attempted to access a directory.

The UREs reported only apply to file system metadata (that is, directory structures). For UREs in object files, the swift system automatically deletes the file and replicates a fresh copy from one of the other replicas.

UREs are a normal feature of large disk drives. It does not mean that the drive has failed. However, if you get regular UREs on a specific drive, then this may indicate that the drive has indeed failed and should be replaced.

You can use standard XFS repair actions to correct the UREs in the file system.

If the XFS repair fails, you should wipe the GPT table as follows (where <drive_name> is replaced by the actual drive name):

ardana > sudo dd if=/dev/zero of=/dev/sd<drive_name> \
bs=$((1024*1024)) count=1

Then follow the steps below which will reformat the drive, remount it, and restart swift services on the affected node.

  1. Log in to the Cloud Lifecycle Manager.

  2. Run the swift reconfigure playbook, specifying the affected node:

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts _swift-configure.yml \
    --limit <hostname>

It is safe to reformat drives containing swift data because swift maintains other copies of the data (usually, swift is configured to have three replicas of all data).

Name: swift service

Description: Alarms if a swift process, specified by the component field, is not running.

Likely cause: A daemon specified by the component dimension on the host specified by the hostname dimension has stopped running.

Examine the /var/log/swift/swift.log file for possible error messages related the swift process. The process in question is listed in the alarm dimensions in the component dimension.

Restart swift processes by running the swift-start.yml playbook, with these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Run the swift start playbook against the affected host:

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts swift-start.yml \
    --limit <hostname>

Name: swift filesystem mount point status

Description: Alarms if a file system/drive used by swift is not correctly mounted.

Likely cause: The device specified by the device dimension is not correctly mounted at the mountpoint specified by the mount dimension.

The most probable cause is that the drive has failed or that it had a temporary failure during the boot process and remained unmounted.

Other possible causes are a file system corruption that prevents the device from being mounted.

Reboot the node and see if the file system remains unmounted.

If the file system is corrupt, see the process used for the "Drive URE errors" alarm to wipe and reformat the drive.

Name: swift uptime-monitor status

Description: Alarms if the swiftlm-uptime-monitor has errors using keystone (keystone-get-token), swift (rest-api) or swift's healthcheck.

Likely cause: The swiftlm-uptime-monitor cannot get a token from keystone or cannot get a successful response from the swift Object-Storage API.

Check that the keystone service is running:

  1. Log in to the Cloud Lifecycle Manager.

  2. Check the status of the keystone service:

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts keystone-status.yml
  3. If it is not running, start the service:

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts keystone-start.yml
  4. Contact the support team if further assistance troubleshooting the keystone service is needed.

Check that swift is running:

  1. Log in to the Cloud Lifecycle Manager.

  2. Check the status of the keystone service:

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts swift-status.yml
  3. If it is not running, start the service:

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts swift-start.yml

Restart the swiftlm-uptime-monitor as follows:

  1. Log into the first server running the swift-proxy-server service. Use this playbook below to determine whcih host this is:

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts swift-status.yml
    --limit SWF-PRX[0]
  2. Restart the swiftlm-uptime-monitor with this command:

    ardana > sudo systemctl restart swiftlm-uptime-monitor

Name: swift keystone server connect

Description: Alarms if a socket cannot be opened to the keystone service (used for token validation)

Likely cause: The Identity service (keystone) server may be down. Another possible cause is that the network between the host reporting the problem and the keystone server or the haproxy process is not forwarding requests to keystone.

The URL dimension contains the name of the virtual IP address. Use cURL or a similar program to confirm that a connection can or cannot be made to the virtual IP address. Check that haproxy is running. Check that the keystone service is working.

Name: swift service listening on ip and port

Description: Alarms when a swift service is not listening on the correct port or ip.

Likely cause: The swift service may be down.

Verify the status of the swift service on the affected host, as specified by the hostname dimension.

  1. Log in to the Cloud Lifecycle Manager.

  2. Run the swift status playbook to confirm status:

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts swift-status.yml \
    --limit <hostname>

If an issue is determined, you can stop and restart the swift service with these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Stop the swift service on the affected host:

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts swift-stop.yml \
    --limit <hostname>
  3. Restart the swift service on the affected host:

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts swift-start.yml \
    --limit <hostname>

Name: swift rings checksum

Description: Alarms if the swift rings checksums do not match on all hosts.

Likely cause: The swift ring files must be the same on every node. The files are located in /etc/swift/*.ring.gz.

If you have just changed any of the rings and you are still deploying the change, it is normal for this alarm to trigger.

If you have just changed any of your swift rings, if you wait until the changes complete then this alarm will likely clear on its own. If it does not, then continue with these steps.

Use sudo swift-recon --md5 to find which node has outdated rings.

Run the swift-reconfigure.yml playbook, using the steps below. This deploys the same set of rings to every node.

  1. Log in to the Cloud Lifecycle Manager.

  2. Run the swift start playbook against the affected host:

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml

Name: swift memcached server connect

Description: Alarms if a socket cannot be opened to the specified memcached server.

Likely cause: The server may be down. The memcached daemon running the server may have stopped.

If the server is down, restart it.

If memcached has stopped, you can restart it by using the memcached-start.yml playbook, using the steps below. If this fails, rebooting the node will restart the process.

  1. Log in to the Cloud Lifecycle Manager.

  2. Run the memcached start playbook against the affected host:

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts memcached-start.yml \
    --limit <hostname>

If the server is running and memcached is running, there may be a network problem blocking port 11211.

If you see sporadic alarms on different servers, the system may be running out of resources. Contact Sales Engineering for advice.

Name: swift individual disk usage exceeds 80%

Description: Alarms when a disk drive used by swift exceeds 80% utilization.

Likely cause: Generally all disk drives will fill roughly at the same rate. If an individual disk drive becomes filled faster than other drives it can indicate a problem with the replication process.

If many or most of your disk drives are 80% full, you need to add more nodes to your system or delete existing objects.

If one disk drive is noticeably (more than 30%) more utilized than the average of other disk drives, check that swift processes are working on the server (use the steps below) and also look for alarms related to the host. Otherwise continue to monitor the situation.

  1. Log in to the Cloud Lifecycle Manager.

  2. Run the swift status:

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts swift-status.yml

Name: swift individual disk usage exceeds 90%

Description: Alarms when a disk drive used by swift exceeds 90% utilization.

Likely cause: Generally all disk drives will fill roughly at the same rate. If an individual disk drive becomes filled faster than other drives it can indicate a problem with the replication process.

If one disk drive is noticeably (more than 30%) more utilized than the average of other disk drives, check that swift processes are working on the server, using these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Run the swift status:

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts swift-status.yml

Also look for alarms related to the host. An individual disk drive filling can indicate a problem with the replication process.

Restart swift on that host using the --limit argument to target the host:

  1. Log in to the Cloud Lifecycle Manager.

  2. Stop the swift service:

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts swift-stop.yml \
    --limit <hostname>
  3. Start the swift service back up:

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts swift-start.yml \
    --limit <hostname>

If the utilization does not return to similar values as other disk drives, you can reformat the disk drive. You should only do this if the average utilization of all disk drives is less than 80%. To format a disk drive contact Sales Engineering for instructions.

Name: swift total disk usage exceeds 80%

Description: Alarms when the average disk utilization of swift disk drives exceeds 80% utilization.

Likely cause: The number and size of objects in your system is beginning to fill the available disk space. Account and container storage is included in disk utilization. However, this generally consumes 1-2% of space compared to objects, so object storage is the dominate consumer of disk space.

You need to add more nodes to your system or delete existing objects to remain under 80% utilization.

If you delete a project/account, the objects in that account are not removed until a week later by the account-reaper process, so this is not a good way of quickly freeing up space.

Name: swift total disk usage exceeds 90%

Description: Alarms when the average disk utilization of swift disk drives exceeds 90% utilization.

Likely cause: The number and size of objects in your system is beginning to fill the available disk space. Account and container storage is included in disk utilization. However, this generally consumes 1-2% of space compared to objects, so object storage is the dominate consumer of disk space.

If your disk drives are 90% full, you must immediately stop all applications that put new objects into the system. At that point you can either delete objects or add more servers.

Using the steps below, set the fallocate_reserve value to a value higher than the currently available space on disk drives. This will prevent more objects being created.

  1. Log in to the Cloud Lifecycle Manager.

  2. Edit the configuration files below and change the value for fallocate_reserve to a value higher than the currently available space on the disk drives:

    ~/openstack/my_cloud/config/swift/account-server.conf.j2
    ~/openstack/my_cloud/config/swift/container-server.conf.j2
    ~/openstack/my_cloud/config/swift/object-server.conf.j2
  3. Commit the changes to git:

    ardana > git add -A
    ardana > git commit -a -m "changing swift fallocate_reserve value"
  4. Run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  5. Update your deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  6. Run the swift reconfigure playbook to deploy the change:

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml

If you allow your file systems to become full, you will be unable to delete objects or add more nodes to the system. This is because the system needs some free space to handle the replication process when adding nodes. With no free space, the replication process cannot work.

Name: swift service per-minute availability

Description: Alarms if the swift service reports unavailable for the previous minute.

Likely cause: The swiftlm-uptime-monitor service runs on the first proxy server. It monitors the swift endpoint and reports latency data. If the endpoint stops reporting, it generates this alarm.

There are many reasons why the endpoint may stop running. Check:

  • Is haproxy running on the control nodes?

  • Is swift-proxy-server running on the swift proxy servers?

Name: swift rsync connect

Description: Alarms if a socket cannot be opened to the specified rsync server

Likely cause: The rsync daemon on the specified node cannot be contacted. The most probable cause is that the node is down. The rsync service might also have been stopped on the node.

Reboot the server if it is down.

Attempt to restart rsync with this command:

systemctl restart rsync.service

Name: swift smart array controller status

Description: Alarms if there is a failure in the Smart Array.

Likely cause: The Smart Array or Smart HBA controller has a fault or a component of the controller (such as a battery) is failed or caching is disabled.

The HPE Smart Storage Administrator (HPE SSA) CLI component will have to be installed for SSACLI status to be reported. HPE-specific binaries that are not based on open source are distributed directly from and supported by HPE. To download and install the SSACLI utility, please refer to: https://support.hpe.com/hpsc/swd/public/detail?swItemId=MTX_3d16386b418a443388c18da82f

Log in to the reported host and run these commands to find out the status of the controllers:

sudo hpssacli
=> controller show all detail

For hardware failures (such as failed battery), replace the failed component. If the cache is disabled, reenable the cache.

Name: swift physical drive status

Description: Alarms if there is a failure in the Physical Drive.

Likely cause:A disk drive on the server has failed or has warnings.

Log in to the reported and run these commands to find out the status of the drive:

sudo hpssacli
=> ctrl slot=1 pd all show

Replace any broken drives.

Name: swift logical drive status

Description: Alarms if there is a failure in the Logical Drive.

Likely cause: A LUN on the server is degraded or has failed.

Log in to the reported host and run these commands to find out the status of the LUN:

sudo hpssacli
=> ctrl slot=1 ld all show
=> ctrl slot=1 pd all show

Replace any broken drives.

Name: Process Check

Description: Alarms when the specified process is not running.

Likely cause: If the service dimension is object-store, see the description of the "swift Service" alarm for possible causes.

If the service dimension is object-storage, see the description of the "swift Service" alarm for possible mitigation tasks.

Name: HTTP Status

Description: Alarms when the specified HTTP endpoint is down or not reachable.

Likely cause: If the service dimension is object-store, see the description of the "swift host socket connect" alarm for possible causes.

If the service dimension is object-storage, see the description of the "swift host socket connect" alarm for possible mitigation tasks.

Name: Service Log Directory Size

Description: Service log directory consuming more disk than its quota.

Likely cause: This could be due to a service set to DEBUG instead of INFO level. Another reason could be due to a repeating error message filling up the log files. Finally, it could be due to log rotate not configured properly so old log files are not being deleted properly.

Find the service that is consuming too much disk space. Look at the logs. If DEBUG log entries exist, set the logging level to INFO. If the logs are repeatedly logging an error message, do what is needed to resolve the error. If old log files exist, configure log rotate to remove them. You could also choose to remove old log files by hand after backing them up if needed.

18.1.1.2.2 SERVICE: BLOCK-STORAGE in Storage section
Alarm InformationMitigation Tasks

Name: Process Check

Description: Separate alarms for each of these cinder services, specified by the component dimension:

  • cinder-api

  • cinder-backup

  • cinder-scheduler

  • cinder-volume

Likely cause: Process crashed.

Restart the process on the affected node. Review the associated logs.

  1. Log in to the Cloud Lifecycle Manager.

  2. Run the cinder-start.yml playbook to start the process back up:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts cinder-start.yml
    --limit <hostname>
    Note
    Note

    The --limit <hostname> switch is optional. If it is included, then the <hostname> you should use is the host where the alarm was raised.

Name: Process Check

Description: Alarms when the specified process is not running: process_name=cinder-backup

Likely cause: Process crashed.

Alert may be incorrect if the service has migrated. Validate that the service is intended to be running on this node before restarting the service. Review the associated logs.

Name: Process Check

Description: Alarms when the specified process is not running:

process_name=cinder-scheduler

Likely cause: Process crashed.

Restart the process on the affected node. Review the associated logs.

  1. Log in to the Cloud Lifecycle Manager.

  2. Run the cinder-start.yml playbook to start the process back up:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts cinder-start.yml \
    --limit <hostname>
    Note
    Note

    The --limit <hostname> switch is optional. If it is included, then the <hostname> you should use is the host where the alarm was raised.

Name: Process Check

Description: Alarms when the specified process is not running: process_name=cinder-volume

Likely cause:Process crashed.

Alert may be incorrect if the service has migrated. Validate that the service is intended to be running on this node before restarting the service. Review the associated logs.

Name: cinder backup running <hostname> check

Description: cinder backup singleton check.

Likely cause: Backup process is one of the following:

  • It is running on a node it should not be on

  • It is not running on a node it should be on

Run the cinder-migrate-volume.yml playbook to migrate the volume and back up to the correct node:

  1. Log in to the Cloud Lifecycle Manager.

  2. Run this playbook to migrate the service:

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts cinder-migrate-volume.yml

Name: cinder volume running <hostname> check

Description: cinder volume singleton check.

Likely cause: The cinder-volume process is either:

  • running on a node it should not be on, or

  • not running on a node it should be on

Run the cinder-migrate-volume.yml playbook to migrate the volume and backup to correct node:

  1. Log in to the Cloud Lifecycle Manager.

  2. Run this playbook to migrate the service:

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts cinder-migrate-volume.yml

Name: Storage faulty lun check

Description: Alarms if local LUNs on your HPE servers using smartarray are not OK.

Likely cause: A LUN on the server is degraded or has failed.

Log in to the reported host and run these commands to find out the status of the LUN:

sudo hpssacli
=> ctrl slot=1 ld all show
=> ctrl slot=1 pd all show

Replace any broken drives.

Name: Storage faulty drive check

Description: Alarms if the local disk drives on your HPE servers using smartarray are not OK.

Likely cause: A disk drive on the server has failed or has warnings.

Log in to the reported and run these commands to find out the status of the drive:

sudo hpssacli
=> ctrl slot=1 pd all show

Replace any broken drives.

Name: Service Log Directory Size

Description: Service log directory consuming more disk than its quota.

Likely cause: This could be due to a service set to DEBUG instead of INFO level. Another reason could be due to a repeating error message filling up the log files. Finally, it could be due to log rotate not configured properly so old log files are not being deleted properly.

Find the service that is consuming too much disk space. Look at the logs. If DEBUG log entries exist, set the logging level to INFO. If the logs are repeatedly logging an error message, do what is needed to resolve the error. If old log files exist, configure log rotate to remove them. You could also choose to remove old log files by hand after backing them up if needed.

18.1.1.3 Networking Alarms

These alarms show under the Networking section of the SUSE OpenStack Cloud Operations Console.

18.1.1.3.1 SERVICE: NETWORKING
Alarm InformationMitigation Tasks

Name: Process Check

Description: Alarms when the specified process is not running. Separate alarms for each of these neutron services, specified by the component dimension:

  • ipsec/charon

  • neutron-openvswitch-agent

  • neutron-l3-agent

  • neutron-dhcp-agent

  • neutron-metadata-agent

  • neutron-server

  • neutron-vpn-agent

Likely cause: Process crashed.

Restart the process on the affected node:

  1. Log in to the Cloud Lifecycle Manager.

  2. Check the status of the networking status:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts neutron-status.yml
  3. Make note of the failed service names and the affected hosts which you will use to review the logs later.

  4. Using the affected hostname(s) from the previous output, run the neutron start playbook to restart the services:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts neutron-start.yml \
    --limit <hostname>
    Note
    Note

    You can pass multiple hostnames with --limit option by separating them with a colon :.

  5. Check the status of the networking service again:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts neutron-status.yml
  6. Once all services are back up, you can SSH to the affected host(s) and review the logs in the location below for any errors around the time that the alarm triggered:

    /var/log/neutron/<service_name>

Name: Process Check

Description: Alarms when the specified process is not running:

process_name = neutron-rootwrap

Likely cause: Process crashed.

Currently neutron-rootwrap is only used to run ovsdb-client. To restart this process, use these steps:

  1. SSH to the affected host(s).

  2. Restart the process:

    sudo systemctl restart neutron-openvswitch-agent
  3. Review the logs at the location below for errors:

    /var/log/neutron/neutron-openvswitch-agent.log

Name: HTTP Status

Description: neutron api health check

Likely cause: Process is stuck if the neutron-server Process Check is not OK.

  1. SSH to the affected host(s).

  2. Run this command to restart the neutron-server process:

    sudo systemctl restart neutron-server
  3. Review the logs at the location below for errors:

    /var/log/neutron/neutron-server.log

Name: HTTP Status

Description: neutron api health check

Likely cause: The node crashed. Alternatively, only connectivity might have been lost if the local node HTTP Status is OK or UNKNOWN.

Reboot the node if it crashed or diagnose the networking connectivity failures between the local and remote nodes. Review the logs.

Name: Service Directory Log Size

Description: Service log directory consuming more disk than its quota.

Likely cause: This could be due to a service set to DEBUG instead of INFO level. Another reason could be due to a repeating error message filling up the log files. Finally, it could be due to log rotate not configured properly so old log files are not being deleted properly.

Find the service that is consuming too much disk space. Look at the logs. If DEBUG log entries exist, set the logging level to INFO. If the logs are repeatedly logging an error message, do what is needed to resolve the error. If old log files exist, configure log rotate to remove them. You could also choose to remove old log files by hand after backing them up if needed.
18.1.1.3.2 SERVICE: DNS in Networking section
Alarm InformationMitigation Tasks

Name: Process Check

Description: Alarms when the specified process is not running:

process_name = designate-zone-manager

Likely cause: Process crashed.

Restart the process on the affected node using these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Use the designate start playbook against the affected node:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts designate-start.yml \
    --limit 'DES-ZMG'

Review the log located at:

/var/log/designate/designate-zone-manager.log

Name: Process Check

Description: Alarms when the specified process is not running:

process_name = designate-pool-manager

Likely cause: Process crashed.

Restart the process on the affected node using these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Use the designate start playbook against the affected node:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts designate-start.yml \
    --limit 'DES-PMG'

Review the log located at:

/var/log/designate/designate-pool-manager.log

Name: Process Check

Description: Alarms when the specified process is not running:

process_name = designate-central

Likely cause: Process crashed.

Restart the process on the affected node using these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Use the designate start playbook against the affected node:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts designate-start.yml \
    --limit 'DES-CEN'

Review the log located at:

/var/log/designate/designate-central.log

Name: Process Check

Description: Alarms when the specified process is not running:

process_name = designate-api

Likely cause: Process crashed.

Restart the process on the affected node using these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Use the designate start playbook against the affected node:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts designate-start.yml \
    --limit 'DES-API'

Review the log located at:

/var/log/designate/designate-api.log

Name: Process Check

Description: Alarms when the specified process is not running:

process_name = designate-mdns

Likely cause: Process crashed.

Restart the process on the affected node using these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Use the designate start playbook against the affected node:

             ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts designate-start.yml \
    --limit 'DES-MDN'

Review the log located at:

/var/log/designate/designate-mdns.log

Name: HTTP Status

Description: component = designate-api This alarm will also have the api_endpoint and monitored_host_types dimensions defined. The likely cause and mitigation steps are the same for both.

Likely cause: The API is unresponsive.

Restart the process on the affected node using these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Use the designate start playbook against the affected node:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts designate-start.yml \
    --limit 'DES-API,DES-CEN'

Review the logs located at:

/var/log/designate/designate-api.log
/var/log/designate/designate-central.log

Name: Service Directory Log Size

Description: Service log directory consuming more disk than its quota.

Likely cause: This could be due to a service set to DEBUG instead of INFO level. Another reason could be due to a repeating error message filling up the log files. Finally, it could be due to log rotate not configured properly so old log files are not being deleted properly.

Find the service that is consuming too much disk space. Look at the logs. If DEBUG log entries exist, set the logging level to INFO. If the logs are repeatedly logging an error message, do what is needed to resolve the error. If old log files exist, configure log rotate to remove them. You could also choose to remove old log files by hand after backing them up if needed.
18.1.1.3.3 SERVICE: BIND in Networking section
Alarm InformationMitigation Tasks

Name: Process Check

Description: Alarms when the specified process is not running: process_name = pdns_server

Likely cause: Process crashed.

Restart the process on the affected node using these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Use the PowerDNS start playbook against the affected node:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts bind-start.yml

Review the log located at, querying against process = pdns_server:

/var/log/syslog

Name: Process Check

Description: Alarms when the specified process is not running: process_name = named

Likely cause: Process crashed.

Restart the process on the affected node using these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Use the Bind start playbook against the affected node:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts bind-start.yml

Review the log located at, querying against process = named:

/var/log/syslog

18.1.1.4 Identity Alarms

These alarms show under the Identity section of the SUSE OpenStack Cloud Operations Console.

18.1.1.4.1 SERVICE: IDENTITY-SERVICE
Alarm InformationMitigation Tasks

Name: HTTP Status

Description: This check is contacting the keystone public endpoint directly.

component=keystone-api
api_endpoint=public

Likely cause: The keystone service is down on the affected node.

Restart the keystone service on the affected node:

  1. Log in to the Cloud Lifecycle Manager.

  2. Use the keystone start playbook against the affected node:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts keystone-start.yml \
    --limit <hostname>

Name: HTTP Status

Description: This check is contacting the keystone admin endpoint directly

component=keystone-api
api_endpoint=admin

Likely cause: The keystone service is down on the affected node.

Restart the keystone service on the affected node:

  1. Log in to the Cloud Lifecycle Manager.

  2. Use the keystone start playbook against the affected node:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts keystone-start.yml \
    --limit <hostname>

Name: HTTP Status

Description: This check is contacting the keystone admin endpoint via the virtual IP address (HAProxy)

component=keystone-api
monitored_host_type=vip

Likely cause: The keystone service is unreachable via the virtual IP address.

If neither the api_endpoint=public or api_endpoint=admin alarms are triggering at the same time then there is likely a problem with haproxy.

You can restart the haproxy service with these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Use this playbook against the affected node:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts FND-CLU-start.yml \
    --limit <hostname>

Name: Process Check

Description: Separate alarms for each of these glance services, specified by the component dimension:

  • keystone-main

  • keystone admin

Likely cause: Process crashed.

You can restart the keystone service with these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Use this playbook against the affected node:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts keystone-start.yml \
    --limit <hostname>

Review the logs in /var/log/keystone on the affected node.

Name: Service Log Directory Size

Description: Service log directory consuming more disk than its quota.

Likely cause: This could be due to a service set to DEBUG instead of INFO level. Another reason could be due to a repeating error message filling up the log files. Finally, it could be due to log rotate not configured properly so old log files are not being deleted properly.

Find the service that is consuming too much disk space. Look at the logs. If DEBUG log entries exist, set the logging level to INFO. If the logs are repeatedly logging an error message, do what is needed to resolve the error. If old log files exist, configure log rotate to remove them. You could also choose to remove old log files by hand after backing them up if needed.

18.1.1.5 Telemetry Alarms

These alarms show under the Telemetry section of the SUSE OpenStack Cloud Operations Console.

18.1.1.5.1 SERVICE: TELEMETRY
Alarm InformationMitigation Tasks

Name: Process Check

Description: Alarms when the ceilometer-agent-notification process is not running.

Likely cause: Process has crashed.

Review the logs on the alarming host in the following location for the cause:

/var/log/ceilometer/ceilometer-agent-notification-json.log

Restart the process on the affected node using these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Use the ceilometer start playbook against the affected node:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ceilometer-start.yml \
    --limit <hostname>

Name: Process Check

Description: Alarms when the ceilometer-polling process is not running.

Likely cause: Process has crashed.

Review the logs on the alarming host in the following location for the cause:

/var/log/ceilometer/ceilometer-polling-json.log

Restart the process on the affected node using these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Use the ceilometer start playbook against the affected node:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ceilometer-start.yml \
    --limit <hostname>
18.1.1.5.2 SERVICE: METERING in Telemetry section
Alarm InformationMitigation Tasks

Name: Service Log Directory Size

Description: Service log directory consuming more disk than its quota.

Likely cause: This could be due to a service set to DEBUG instead of INFO level. Another reason could be due to a repeating error message filling up the log files. Finally, it could be due to log rotate not configured properly so old log files are not being deleted properly.

Find the service that is consuming too much disk space. Look at the logs. If DEBUG log entries exist, set the logging level to INFO. If the logs are repeatedly logging an error message, do what is needed to resolve the error. If old log files exist, configure log rotate to remove them. You could also choose to remove old log files by hand after backing them up if needed.
18.1.1.5.3 SERVICE: KAFKA in Telemetry section
Alarm InformationMitigation Tasks

Name: Kafka Persister Metric Consumer Lag

Description: Alarms when the Persister consumer group is not keeping up with the incoming messages on the metric topic.

Likely cause: There is a slow down in the system or heavy load.

Verify that all of the monasca-persister services are up with these steps:

  1. Log in to the Cloud Lifecycle Manager

  2. Verify that all of the monasca-persister services are up with this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-status.yml \
    --tags monasca-persister

Look for high load in the various systems. This alert can fire for multiple topics or on multiple hosts. Determining which alarms are firing can help diagnose likely causes. For example, if the alarm is alerting all on one machine it could be the machine. If one topic across multiple machines it is likely the consumers of that topic, etc.

Name: Kafka Alarm Transition Consumer Lag

Description: Alarms when the specified consumer group is not keeping up with the incoming messages on the alarm state transition topic.

Likely cause: There is a slow down in the system or heavy load.

Check that monasca-thresh and monasca-notification are up.

Look for high load in the various systems. This alert can fire for multiple topics or on multiple hosts. Which alarms are firing can help diagnose likely causes. For example:

  • If all alarms are on the same machine, the machine could be at fault.

  • If one topic is shared across multiple machines, the consumers of that topic are likely at fault.

Name: Kafka Kronos Consumer Lag

Description: Alarms when the Kronos consumer group is not keeping up with the incoming messages on the metric topic.

Likely cause: There is a slow down in the system or heavy load.

Look for high load in the various systems. This alert can fire for multiple topics or on multiple hosts. Which alarms are firing can help diagnose likely causes. For example:

  • If all alarms are on the same machine, the machine could be at fault.

  • If one topic is shared across multiple machines, the consumers of that topic are likely at fault.

Name: Process Check

Description: Alarms when the specified process is not running:

process_name = kafka.Kafka

Likely cause:

Restart the process on the affected node using these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Stop the kafka service with this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-start.yml \
    --tags kafka
  3. Start the kafka service back up with this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-status.yml \
    --tags kafka

Review the logs in /var/log/kafka/server.log

18.1.1.5.4 SERVICE: LOGGING in Telemetry section
Alarm InformationMitigation Tasks

Name: Beaver Memory Usage

Description: Beaver is using more memory than expected. This may indicate that it cannot forward messages and its queue is filling up. If you continue to see this, see the troubleshooting guide.

Likely cause: Overloaded system or services with memory leaks.

Log on to the reporting host to investigate high memory users.

Name: Audit Log Partition Low Watermark

Description: The /var/audit disk space usage has crossed low watermark. If the high watermark is reached, logrotate will be run to free up disk space. If needed, adjust:

var_audit_low_watermark_percent

Likely cause: This could be due to a service set to DEBUG instead of INFO level. Another reason could be due to a repeating error message filling up the log files. Finally, it could be due to log rotate not configured properly so old log files are not being deleted properly.

Find the service that is consuming too much disk space. Look at the logs. If DEBUG log entries exist, set the logging level to INFO. If the logs are repeatedly logging an error message, do what is needed to resolve the error. If old log files exist, configure log rotate to remove them. You could also choose to remove old log files by hand after backing them up if needed.

Name: Audit Log Partition High Watermark

Description: The /var/audit volume is running low on disk space. Logrotate will be run now to free up space. If needed, adjust:

var_audit_high_watermark_percent

Likely cause: This could be due to a service set to DEBUG instead of INFO level. Another reason could be due to a repeating error message filling up the log files. Finally, it could be due to log rotate not configured properly so old log files are not being deleted properly.

Find the service that is consuming too much disk space. Look at the logs. If DEBUG log entries exist, set the logging level to INFO. If the logs are repeatedly logging an error message, do what is needed to resolve the error. If old log files exist, configure log rotate to remove them. You could also choose to remove old log files by hand after backing them up if needed.

Name: Elasticsearch Unassigned Shards

Description: component = elasticsearch; Elasticsearch unassigned shards count is greater than 0.

Likely cause: Environment could be misconfigured.

To find the unassigned shards, run the following command on the Cloud Lifecycle Manager from the ~/scratch/ansible/next/ardana/ansible directory:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible -i hosts/verb_hosts LOG-SVR[0] -m shell -a \
"curl localhost:9200/_cat/shards?pretty -s" | grep UNASSIGNED

This shows which shards are unassigned, like this:

logstash-2015.10.21 4 p UNASSIGNED ... 10.240.75.10 NodeName

The last column shows the name that Elasticsearch uses for the node that the unassigned shards are on. To find the actual host name, run:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible -i hosts/verb_hosts LOG-SVR[0] -m shell -a \
"curl localhost:9200/_nodes/_all/name?pretty -s"

When you find the host name, take the following steps:

  1. Make sure the node is not out of disk space, and free up space if needed.

  2. Restart the node (use caution, as this may affect other services as well).

  3. Make sure all versions of Elasticsearch are the same:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible -i hosts/verb_hosts LOG-SVR -m shell -a \
    "curl localhost:9200/_nodes/_local/name?pretty -s" | grep version
  4. Contact customer support.

Name: Elasticsearch Number of Log Entries

Description: Elasticsearch Number of Log Entries: component = elasticsearch;

Likely cause: The number of log entries may get too large.

Older versions of Kibana (version 3 and earlier) may hang if the number of log entries is too large (for example, above 40,000), and the page size would need to be small enough (about 20,000 results), because if it is larger (for example, 200,000), it may hang the browser, but Kibana 4 should not have this issue.

Name: Elasticsearch Field Data Evictions

Description: Elasticsearch Field Data Evictions count is greater than 0: component = elasticsearch

Likely cause: Field Data Evictions may be found even though it is nowhere near the limit set.

The elasticsearch_indices_fielddata_cache_size is set to unbounded by default. If this is set by the user to a value that is insufficient, you may need to increase this configuration parameter or set it to unbounded and run a reconfigure using the steps below:

  1. Log in to the Cloud Lifecycle Manager.

  2. Edit the configuration file below and change the value for elasticsearch_indices_fielddata_cache_size to your desired value:

    ~/openstack/my_cloud/config/logging/main.yml
  3. Commit the changes to git:

    ardana > git add -A
    ardana > git commit -a -m "Elasticsearch fielddata cache size"
  4. Run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  5. Update your deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  6. Run the Logging reconfigure playbook to deploy the change:

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts kronos-reconfigure.yml

Name: Service Log Directory Size

Description: Service log directory consuming more disk than its quota.

Likely cause: This could be due to a service set to DEBUG instead of INFO level. Another reason could be due to a repeating error message filling up the log files. Finally, it could be due to log rotate not configured properly so old log files are not being deleted properly.

Find the service that is consuming too much disk space. Look at the logs. If DEBUG log entries exist, set the logging level to INFO. If the logs are repeatedly logging an error message, do what is needed to resolve the error. If old log files exist, configure log rotate to remove them. You could also choose to remove old log files by hand after backing them up if needed.

Name: Process Check

Description: Separate alarms for each of these logging services, specified by the process_name dimension:

  • elasticsearch

  • logstash

  • beaver

  • apache2

  • kibana

Likely cause: Process has crashed.

On the affected node, attempt to restart the process.

If the elasticsearch process has crashed, use:

ardana > sudo systemctl restart elasticsearch

If the logstash process has crashed, use:

ardana > sudo systemctl restart logstash

The rest of the processes can be restarted using similar commands, listed here:

ardana > sudo systemctl restart beaver
ardana > sudo systemctl restart apache2
ardana > sudo systemctl restart kibana
18.1.1.5.5 SERVICE: MONASCA-TRANSFORM in Telemetry section
Alarm InformationMitigation Tasks

Name: Process Check

Description: process_name = pyspark

Likely cause: Service process has crashed.

Restart process on affected node. Review logs.

Child process of spark-worker but created once the monasca-transform process begins processing streams. If the process fails on one node only, along with the pyspark process, it is likely that the spark-worker has failed to connect to the elected leader of the spark-master service. In this case the spark-worker service should be started on the affected node. If on multiple nodes check the spark-worker, spark-master and monasca-transform services and logs. If the monasca-transform or spark services have been interrupted this process may not re-appear for up to ten minutes (the stream processing interval).

Name: Process Check

Description:

process_name =
org.apache.spark.executor.CoarseGrainedExecutorBackend

Likely cause: Service process has crashed.

Restart process on affected node. Review logs.

Child process of spark-worker but created once the monasca-transform process begins processing streams. If the process fails on one node only, along with the pyspark process, it is likely that the spark-worker has failed to connect to the elected leader of the spark-master service. In this case the spark-worker service should be started on the affected node. If on multiple nodes check the spark-worker, spark-master and monasca-transform services and logs. If the monasca-transform or spark services have been interrupted this process may not re-appear for up to ten minutes (the stream processing interval).

Name: Process Check

Description: process_name = monasca-transform

Likely cause: Service process has crashed.

Restart the service on affected node. Review logs.
18.1.1.5.6 SERVICE: MONITORING in Telemetery section
Alarm InformationMitigation Tasks

Name: HTTP Status

Description: Persister Health Check component = monasca-persister

Likely cause: The process has crashed or a dependency is out.

If the process has crashed, restart it using the steps below. If a dependent service is down, address that issue.

Restart the process on the affected node using these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Check if monasca-api is running on all nodes with this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-status.yml \
    --tags monasca-persister
  3. Use the monasca start playbook against the affected node to restart it:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-start.yml \
    --tags persister
  4. Verify that it is running on all nodes with this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-status.yml \
    --tags monasca-persister

Review the associated logs.

Name: HTTP Status

Description: API Health Check component = monasca-api

Likely cause: The process has crashed or a dependency is out.

If the process has crashed, restart it using the steps below. If a dependent service is down, address that issue.

Restart the process on the affected node using these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Check if monasca-api is running on all nodes with this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-status.yml \
    --tags monasca-api
  3. Use the monasca start playbook against the affected node to restart it:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-start.yml \
    --tags monasca-api
  4. Verify that it is running on all nodes with this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-status.yml \
    --tags monasca-api

Review the associated logs.

Name: monasca Agent Collection Time

Description: Alarms when the elapsed time the monasca-agent takes to collect metrics is high.

Likely cause: Heavy load on the box or a stuck agent plug-in.

Address the load issue on the machine. If needed, restart the agent using the steps below:

Restart the agent on the affected node using these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Check if monasca-agent is running on all nodes with this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-agent-status.yml
  3. Use the monasca start playbook against the affected node to restart it:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-start.yml \
    --limit <hostname>
  4. Verify that it is running on all nodes with this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-agent-status.yml

Review the associated logs.

Name: Process Check

Description: Alarms when the specified process is not running: component = kafka

Likely cause: Process crashed.

Restart the process on the affected node using these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Check if Kafka is running on all nodes with this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-status.yml \
    --tags kafka
  3. Use the monasca start playbook against the affected node to restart it:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-start.yml \
    --tags kafka
  4. Verify that Kafka is running on all nodes with this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-status.yml \
    --tags kafka

Review the associated logs.

Name: Process Check

Description: Alarms when the specified process is not running:

process_name = monasca-notification

Likely cause: Process crashed.

Restart the process on the affected node using these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Check if monasca-api is running on all nodes with this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-status.yml \
    --tags notification
  3. Use the monasca start playbook against the affected node to restart it:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-start.yml \
    --tags notification
  4. Verify that it is running on all nodes with this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-status.yml \
    --tags notification

Review the associated logs.

Name: Process Check

Description: Alarms when the specified process is not running: process_name = monasca-agent

Likely cause: Process crashed.

Restart the agent on the affected node using these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Check if monasca-agent is running on all nodes with this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-agent-status.yml
  3. Use the monasca start playbook against the affected node to restart it:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-start.yml \
    --limit <hostname>
  4. Verify that it is running on all nodes with this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-agent-status.yml

Review the associated logs.

Name: Process Check

Description: Alarms when the specified process is not running: process_name = monasca-api

Likely cause: Process crashed.

>Restart the process on the affected node using these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Check if monasca-api is running on all nodes with this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-status.yml \
    --tags monasca-api
  3. Use the monasca start playbook against the affected node to restart it:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-start.yml \
    --tags monasca-api
  4. Verify that it is running on all nodes with this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-status.yml \
    --tags monasca-api

Review the associated logs.

Name: Process Check

Description: Alarms when the specified process is not running: process_name = monasca-persister

Likely cause: Process crashed.

Restart the process on the affected node using these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Check if monasca-api is running on all nodes with this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-status.yml \
    --tags monasca-persister
  3. Use the monasca start playbook against the affected node to restart it:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-start.yml \
    --tags persister
  4. Verify that it is running on all nodes with this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-status.yml \
    --tags monasca-persister

Review the associated logs.

Name: Process Check

Description: Alarms when the specified process is not running:

process_name = backtype.storm.daemon.nimbus
component = apache-storm

Likely cause: Process crashed.

Review the logs in the /var/log/storm directory on all storm hosts to find the root cause.

Note
Note

The logs containing threshold engine logging are on the 2nd and 3rd controller nodes.

Restart monasca-thresh, if necessary, with these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Check if monasca-thresh is running on all nodes with this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-status.yml \
    --tags thresh
  3. Use the monasca start playbook against the affected node to restart it:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-start.yml \
    --tags thresh
  4. Verify that it is running on all nodes with this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-status.yml \
    --tags thresh

Name: Process Check

Description: Alarms when the specified process is not running:

process_name = backtype.storm.daemon.supervisor
component = apache-storm

Likely cause: Process crashed.

Review the logs in the /var/log/storm directory on all storm hosts to find the root cause.

Note
Note

The logs containing threshold engine logging are on the 2nd and 3rd controller nodes.

Restart monasca-thresh with these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Stop the monasca-thresh service:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-stop.yml \
    --tags thresh
  3. Start the monasca-thresh service back up:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-start.yml \
    --tags thresh
  4. Verify that it is running on all nodes with this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-status.yml \
    --tags thresh

Name: Process Check

Description: Alarms when the specified process is not running:

process_name = backtype.storm.daemon.worker
component = apache-storm

Likely cause: Process crashed.

Review the logs in the /var/log/storm directory on all storm hosts to find the root cause.

Note
Note

The logs containing threshold engine logging are on the 2nd and 3rd controller nodes.

Restart monasca-thresh with these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Stop the monasca-thresh service:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-stop.yml \
    --tags thresh
  3. Start the monasca-thresh service back up:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-start.yml \
    --tags thresh
  4. Verify that it is running on all nodes with this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-status.yml \
    --tags thresh

Name: Process Check

Description: Alarms when the specified process is not running:

process_name = monasca-thresh
component = apache-storm

Likely cause: Process crashed.

Restart the process on the affected node using these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Check if monasca-thresh is running on all nodes with this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-status.yml \
    --tags thresh
  3. Use the monasca start playbook against the affected node to restart it:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-start.yml \
    --tags thresh
  4. Verify that it is running on all nodes with this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-status.yml \
    --tags thresh

Review the associated logs.

Name: Service Log Directory Size

Description: Service log directory consuming more disk than its quota.

Likely cause: The service log directory, as indicated by the path dimension, is over the 2.5 GB quota.

Find the service that is consuming too much disk space. Look at the logs. If DEBUG log entries exist, set the logging level to INFO. If the logs are repeatedly logging an error message, do what is needed to resolve the error. If old log files exist, configure log rotate to remove them. You could also choose to remove old log files by hand after backing them up if needed.

18.1.1.6 Console Alarms

These alarms show under the Console section of the SUSE OpenStack Cloud Operations Console.

Alarm InformationMitigation Tasks

Name: HTTP Status

Description: service=ops-console

Likely cause: The Operations Console is unresponsive

Review logs in /var/log/ops-console and logs in /var/log/apache2. Restart ops-console by running the following commands on the Cloud Lifecycle Manager:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts ops-console-stop.yml
ardana > ansible-playbook -i hosts/verb_hosts ops-console-start.yml

Name: Process Check

Description: Alarms when the specified process is not running: process_name=leia-leia_monitor

Likely cause: Process crashed or unresponsive.

Review logs in /var/log/ops-console. Restart ops-console by running the following commands on the Cloud Lifecycle Manager:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts ops-console-stop.yml
ardana > ansible-playbook -i hosts/verb_hosts ops-console-start.yml

18.1.1.7 System Alarms

These alarms show under the System section and are set up per hostname and/or mount_point.

18.1.1.7.1 SERVICE: SYSTEM
Alarm InformationMitigation Tasks

Name: CPU Usage

Description: Alarms on high CPU usage.

Likely cause: Heavy load or runaway processes.

Log onto the reporting host and diagnose the heavy CPU usage.

Name: Elasticsearch Low Watermark

Description: component = elasticsearch Elasticsearch disk low watermark. Backup indices. If high watermark is reached, indices will be deleted. Adjust curator_low_watermark_percent, curator_high_watermark_percent, and elasticsearch_max_total_indices_size_in_bytes if needed.

Likely cause: Running out of disk space for /var/lib/elasticsearch.

Free up space by removing indices (backing them up first if desired). Alternatively, adjust curator_low_watermark_percent, curator_high_watermark_percent, and/or elasticsearch_max_total_indices_size_in_bytes if needed.

For more information about how to back up your centralized logs, see Section 13.2.5, “Configuring Centralized Logging”.

Name: Elasticsearch High Watermark

Description: component = elasticsearch Elasticsearch disk high watermark. Attempting to delete indices to free disk space. Adjust curator_low_watermark_percent, curator_high_watermark_percent, and elasticsearch_max_total_indices_size_in_bytes if needed.

Likely cause: Running out of disk space for /var/lib/elasticsearch

Verify that disk space was freed up by the curator. If needed, free up additional space by removing indices (backing them up first if desired). Alternatively, adjust curator_low_watermark_percent, curator_high_watermark_percent, and/or elasticsearch_max_total_indices_size_in_bytes if needed.

For more information about how to back up your centralized logs, see Section 13.2.5, “Configuring Centralized Logging”.

Name: Log Partition Low Watermark

Description: The /var/log disk space usage has crossed the low watermark. If the high watermark is reached, logrotate will be run to free up disk space. Adjust var_log_low_watermark_percent if needed.

Likely cause: This could be due to a service set to DEBUG instead of INFO level. Another reason could be due to a repeating error message filling up the log files. Finally, it could be due to log rotate not configured properly so old log files are not being deleted properly.

Find the service that is consuming too much disk space. Look at the logs. If DEBUG log entries exist, set the logging level to INFO. If the logs are repeatedly logging an error message, do what is needed to resolve the error. If old log files exist, configure log rotate to remove them. You could also choose to remove old log files by hand after backing them up if needed.

Name: Log Partition High Watermark

Description: The /var/log volume is running low on disk space. Logrotate will be run now to free up space. Adjust var_log_high_watermark_percent if needed.

Likely cause: This could be due to a service set to DEBUG instead of INFO level. Another reason could be due to a repeating error message filling up the log files. Finally, it could be due to log rotate not configured properly so old log files are not being deleted properly.

Find the service that is consuming too much disk space. Look at the logs. If DEBUG log entries exist, set the logging level to INFO. If the logs are repeatedly logging an error message, do what is needed to resolve the error. If old log files exist, configure log rotate to remove them. You could also choose to remove old log files by hand after backing them up if needed.

Name: Crash Dump Count

Description: Alarms if it receives any metrics with crash.dump_count > 0

Likely cause: When a crash dump is generated by kdump, the crash dump file is put into the /var/crash directory by default. Any crash dump files in this directory will cause the crash.dump_count metric to show a value greater than 0.

Analyze the crash dump file(s) located in /var/crash on the host that generated the alarm to try to determine if a service or hardware caused the crash.

Move the file to a new location so that a developer can take a look at it. Make sure all of the processes are back up after the crash (run the <service>-status.yml playbooks). When the /var/crash directory is empty, the Crash Dump Count alarm should transition back to OK.

Name: Disk Inode Usage

Description: Nearly out of inodes for a partition, as indicated by the mount_point reported.

Likely cause: Many files on the disk.

Investigate cleanup of data or migration to other partitions.

Name: Disk Usage

Description: High disk usage, as indicated by the mount_point reported.

Likely cause: Large files on the disk.

Investigate cleanup of data or migration to other partitions.

Name: Host Status

Description: Alerts when a host is unreachable. test_type = ping

Likely cause: Host or network is down.

If a single host, attempt to restart the system. If multiple hosts, investigate network issues.

Name: Memory Usage

Description: High memory usage.

Likely cause: Overloaded system or services with memory leaks.

Log onto the reporting host to investigate high memory users.

Name: Network Errors

Description: Alarms on a high network error rate.

Likely cause: Bad network or cabling.

Take this host out of service until the network can be fixed.

Name: NTP Time Sync

Description: Alarms when the NTP time offset is high.

Log in to the reported host and check if the ntp service is running.

If it is running, then use these steps:

  1. Stop the service:

    service ntpd stop
  2. Resynchronize the node's time:

    /usr/sbin/ntpdate -b  <ntp-server>
  3. Restart the ntp service:

    service ntp start
  4. Restart rsyslog:

    service rsyslog restart

18.1.1.8 Other Services Alarms

These alarms show under the Other Services section of the SUSE OpenStack Cloud Operations Console.

18.1.1.8.1 SERVICE: APACHE
Alarm InformationMitigation Tasks

Name: Apache Status

Description: Alarms on failure to reach the Apache status endpoint.

 

Name: Process Check

Description: Alarms when the specified process is not running: process_name = apache2

If the Apache process goes down, connect to the affected node via SSH and restart it with this command: sudo systemctl restart apache2

Name: Apache Idle Worker Count

Description: Alarms when there are no idle workers in the Apache server.

 
18.1.1.8.2 SERVICE: BACKUP in Other Services section
Alarm InformationMitigation Tasks

Name: Service Log Directory Size

Description: Service log directory consuming more disk than its quota.

Likely cause: The service log directory, as indicated by the path dimension, is over the 2.5 GB quota.

Find the service that is consuming too much disk space. Look at the logs. If DEBUG log entries exist, set the logging level to INFO. If the logs are repeatedly logging an error message, do what is needed to resolve the error. If old log files exist, configure log rotate to remove them. You could also choose to remove old log files by hand after backing them up if needed.
18.1.1.8.3 SERVICE: HAPROXY in Other Services section
Alarm InformationMitigation Tasks

Name: Process Check

Description: Alarms when the specified process is not running: process_name = haproxy

Likely cause: HA Proxy is not running on this machine.

Restart the process on the affected node:

  1. Log in to the Cloud Lifecycle Manager.

  2. Run this playbook on the affected node:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts FND-CLU-start.yml \
    --limit <hostname>

Review the associated logs.

18.1.1.8.4 SERVICE: ARDANA-UX-SERVICES in Other Services section
Alarm InformationMitigation Tasks

Name: HTTP Status

Description: Alarms when the specified HTTP endpoint is down or not reachable.

 
18.1.1.8.5 SERVICE: KEY-MANAGER in Other Services section
Alarm InformationMitigation Tasks

Name: Process Check

Description: Alarms when the specified process is not running:

process_name = barbican-api

Likely cause: Process has crashed.

Restart the process on the affected node using these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Use the barbican start playbook against the affected node:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts barbican-start.yml \
    --limit <hostname>

Name: HTTP Status

Description: Alarms when the specified HTTP endpoint is down or not reachable.

component = barbican-api
api_endpoint = public or internal

Likely cause: The endpoint is not responsive, it may be down.

For the HTTP Status alarms for the public and internal endpoints, restart the process on the affected node using these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Stop the barbican service:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts barbican-stop.yml \
    --limit <hostname>
  3. Restart the barbican service back up:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts barbican-start.yml \
    --limit <hostname>

Examine the logs in /var/log/barbican/ for possible error messages.

Name: HTTP Status

Description: Alarms when the specified HTTP endpoint is down or not reachable.

component = barbican-api
monitored_host_type = vip

Likely cause: The barbican API on the admin virtual IP is down.

This alarm is verifying access to the barbican API via the virtual IP address (HAProxy). If this check is failing but the other two HTTP Status alarms for the key-manager service are not then the issue is likely with HAProxy so you should view the alarms for that service. If the other two HTTP Status alarms are alerting as well then restart barbican using the steps listed.

Name: Service Log Directory Size

Description: Service log directory consuming more disk than its quota.

Likely cause: The service log directory, as indicated by the path dimension, is over the 2.5 GB quota.

Find the service that is consuming too much disk space. Look at the logs. If DEBUG log entries exist, set the logging level to INFO. If the logs are repeatedly logging an error message, do what is needed to resolve the error. If old log files exist, configure log rotate to remove them. You could also choose to remove old log files by hand after backing them up if needed.
18.1.1.8.6 SERVICE: MYSQL in Other Services section
Alarm InformationMitigation Tasks

Name: MySQL Slow Query Rate

Description: Alarms when the slow query rate is high.

Likely cause: The system load is too high.

This could be an indication of near capacity limits or an exposed bad query. First, check overall system load and then investigate MySQL details.

Name: Process Check

Description: Alarms when the specified process is not running.

Likely cause: MySQL crashed.

Restart MySQL on the affected node.
18.1.1.8.7 SERVICE: OCTAVIA in Other Services section
Alarm InformationMitigation Tasks

Name: Process Check

Description: Alarms when the specified process is not running. There are individual alarms for each of these processes:

  • octavia-worker

  • octavia-housekeeping

  • octavia-api

  • octavia-health-manager

Likely cause: The process has crashed.

Restart the process on the affected node using these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Use the Octavia start playbook against the affected node:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts octavia-start.yml \
    --limit <hostname>

Name: HTTP Status

Description: Alarms when the specified HTTP endpoint is down or not reachable.

Likely cause: The octavia-api process could be down or you could be experiencing an issue with either haproxy or another network related issue.

If the octavia-api process is down, restart it on the affected node using these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Use the Octavia start playbook against the affected node:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts octavia-start.yml \
    --limit <hostname>

If it is not the octavia-process that is the issue, then check if there is an issue with haproxy or possibly a network issue and troubleshoot accordingly.

Name: Service Log Directory Size

Description: Service log directory consuming more disk than its quota.

Likely cause: The service log directory, as indicated by the path dimension, is over the 2.5 GB quota.

Find the service that is consuming too much disk space. Look at the logs. If DEBUG log entries exist, set the logging level to INFO. If the logs are repeatedly logging an error message, do what is needed to resolve the error. If old log files exist, configure log rotate to remove them. You could also choose to remove old log files by hand after backing them up if needed.
18.1.1.8.8 SERVICE: ORCHESTRATION in Other Services section
Alarm InformationMitigation Tasks

Name: Process Check

Description: Alarms when the specified process is not running. There are individual alarms for each of these processes:

  • heat-api

  • heat-api-cfn

  • heat-api-cloudwatch

  • heat-engine

heat-api process check on each node

Likely cause: Process crashed.

Restart the process with these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Stop all the heat processes:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts heat-start.yml
  3. Start the heat processes back up:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts heat-start.yml

Review the relevant log at the following locations on the affected node:

/var/log/heat/heat-api.log
/var/log/heat/heat-cfn.log
/var/log/heat/heat-cloudwatch.log
/var/log/heat/heat-engine.log

Name: HTTP Status

Description: Alarms when the specified HTTP endpoint is down or not reachable.

  • heat-api

  • heat-api-cfn

  • heat-api-cloudwatch

Restart the heat service with these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Stop all the heat processes:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts heat-start.yml
  3. Start the heat processes back up:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts heat-start.yml

Review the relevant log at the following locations on the affected node:

/var/log/heat/heat-api.log
/var/log/heat/heat-cfn.log
/var/log/heat/heat-cloudwatch.log

Name: Service Log Directory Size

Description: Service log directory consuming more disk than its quota.

Likely cause: The service log directory, as indicated by the path dimension, is over the 2.5 GB quota.

Find the service that is consuming too much disk space. Look at the logs. If DEBUG log entries exist, set the logging level to INFO. If the logs are repeatedly logging an error message, do what is needed to resolve the error. If old log files exist, configure log rotate to remove them. You could also choose to remove old log files by hand after backing them up if needed.
18.1.1.8.9 SERVICE: OVSVAPP-SERVICEVM in Other Services section
Alarm InformationMitigation Tasks

Name: Process Check

Description:Alarms when the specified process is not running:

process_name = ovs-vswitchd
process_name = neutron-ovsvapp-agent
process_name = ovsdb-server

Likely cause: Process has crashed.

Restart process on affected node. Review logs.
18.1.1.8.10 SERVICE: RABBITMQ in Other Services section
Alarm InformationMitigation Tasks

Name: Process Check

Description: Alarms when the specified process is not running:

process_name = rabbitmq
process_name = epmd

Likely cause: Process has crashed.

Restart process on affected node. Review logs.
18.1.1.8.11 SERVICE: SPARK in Other Services section
Alarm InformationMitigation Tasks

Name: Process Check

Description: Alarms when the specified process is not running

process_name = org.apache.spark.deploy.master.Master
process_name = org.apache.spark.deploy.worker.Worker

Likely cause: Process has crashed.

Restart process on affected node. Review logs.
18.1.1.8.12 SERVICE: WEB-UI in Other Services section
Alarm InformationMitigation Tasks

Name: HTTP Status

Description: Alarms when the specified HTTP endpoint is down or not reachable.

Likely cause: Apache is not running or there is a misconfiguration.

Check that Apache is running; investigate horizon logs.

Name: Service Log Directory Size

Description: Service log directory consuming more disk than its quota.

Likely cause: The service log directory, as indicated by the path dimension, is over the 2.5 GB quota.

Find the service that is consuming too much disk space. Look at the logs. If DEBUG log entries exist, set the logging level to INFO. If the logs are repeatedly logging an error message, do what is needed to resolve the error. If old log files exist, configure log rotate to remove them. You could also choose to remove old log files by hand after backing them up if needed.
18.1.1.8.13 SERVICE: ZOOKEEPER in Other Services section
Alarm InformationMitigation Tasks

Name: Process Check

Description: Alarms when the specified process is not running: process_name = org.apache.zookeeper.server

Likely cause: Process crashed.

Restart the process on the affected node. Review the associated logs.

Name: ZooKeeper Latency

Description: Alarms when the ZooKeeper latency is high.

Likely cause: Heavy system load.

Check the individual system as well as activity across the entire service.

18.1.1.9 ESX vCenter Plugin Alarms

These alarms relate to your ESX cluster, if you are utilizing one.

Alarm InformationMitigation Tasks

Name: ESX cluster CPU Usage

Description: Alarms when average of CPU usage for a particular cluster exceeds 90% continuously for 3 polling cycles.

Alarm will have the following dimension:

esx_cluster_id=<domain>.<vcenter-id>

Likely cause: Virtual machines are consuming more than 90% of allocated vCPUs.

  • Reduce the load on highly consuming virtual machines by restarting/stopping one or more services.

  • Add more vCPUs to the host(s) attached to the cluster.

Name: ESX cluster Disk Usage

Description:

  • Alarms when the total size of the all shared datastores attached to the cluster exceeds 90% of their total allocated capacity.

  • Or in the case of cluster having a single host, the size of non-shared datastore exceeds 90% of its allocated capacity.

  • Alarm will have the following dimension:

    esx_cluster_id=<domain>.<vcenter-id>

Likely cause:

  • Virtual machines occupying the storage.

  • Large file or image being copied on the datastore(s).

  • Check the virtual machines that are consuming more disk space. Delete unnecessary files.

  • Delete unnecessary files and images from database(s).

  • Add storage to the datastore(s).

Name: ESX cluster Memory Usage

Description: Alarms when average of RAM memory usage for a particular cluster, exceeds 90% continuously for 3 polling cycles.

Alarm will have the following dimension:

esx_cluster_id=<domain>.<vcenter-id>

Likely cause: Virtual machines are consuming more than 90% of their total allocated memory.

  • Reduce the load on the highly consuming virtual machines by restarting or stopping one or more services.

  • Add more memory to the host(s) attached to the cluster.

18.1.2 Support Resources

To solve issues in your cloud, consult the Knowledge Base or contact Sales Engineering.

18.1.2.1 Using the Knowledge Base

Support information is available at the SUSE Support page https://www.suse.com/products/suse-openstack-cloud/. This page offers access to the Knowledge Base, forums and documentation.

18.1.2.2 Contacting SUSE Support

The central location for information about accessing and using SUSE Technical Support is available at https://www.suse.com/support/handbook/. This page has guidelines and links to many online support services, such as support account management, incident reporting, issue reporting, feature requests, training, consulting.

18.2 Control Plane Troubleshooting

Troubleshooting procedures for control plane services.

18.2.1 Understanding and Recovering RabbitMQ after Failure

RabbitMQ is the message queue service that runs on each of your controller nodes and brokers communication between multiple services in your SUSE OpenStack Cloud 9 cloud environment. It is important for cloud operators to understand how different troubleshooting scenarios affect RabbitMQ so they can minimize downtime in their environments. We are going to discuss multiple scenarios and how it affects RabbitMQ. We will also explain how you can recover from them if there are issues.

18.2.1.1 How upgrades affect RabbitMQ

There are two types of upgrades within SUSE OpenStack Cloud -- major and minor. The effect that the upgrade process has on RabbitMQ depends on these types.

A major upgrade is defined by an erlang change or major version upgrade of RabbitMQ. A minor upgrade would be an upgrade where RabbitMQ stays within the same version, such as v3.4.3 to v.3.4.6.

During both types of upgrades there may be minor blips in the authentication process of client services as the accounts are recreated.

RabbitMQ during a major upgrade

There will be a RabbitMQ service outage while the upgrade is performed.

During the upgrade, high availability consistency is compromised -- all but the primary node will go down and will be reset, meaning their database copies are deleted. The primary node is not taken down until the last step and then it is upgrade. The database of users and permissions is maintained during this process. Then the other nodes are brought back into the cluster and resynchronized.

RabbitMQ during a minor upgrade

Minor upgrades are performed node by node. This "rolling" process means there should be no overall service outage because each node is taken out of its cluster in turn, its database is reset, and then it is added back to the cluster and resynchronized.

18.2.1.2 How RabbitMQ is affected by other operational processes

There are operational tasks, such as Section 15.1.1.1, “Bringing Down Your Cloud: Services Down Method”, where you use the ardana-stop.yml and ardana-start.yml playbooks to gracefully restart your cloud. If you use these playbooks, and there are no errors associated with them forcing you to troubleshoot further, then RabbitMQ is brought down gracefully and brought back up. There is nothing special to note regarding RabbitMQ in these normal operational processes.

However, there are other scenarios where an understanding of RabbitMQ is important when a graceful shutdown did not occur.

These examples that follow assume you are using one of the entry-scale models where RabbitMQ is hosted on your controller node cluster. If you are using a mid-scale model or have a dedicated cluster that RabbitMQ lives on you may need to alter the steps accordingly. To determine which nodes RabbitMQ is on you can use the rabbit-status.yml playbook from your Cloud Lifecycle Manager.

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts rabbitmq-status.yml

Your entire control plane cluster goes down

If you have a scenario where all of your controller nodes went down, either manually or via another process such as a power outage, then an understanding of how RabbitMQ should be brought back up is important. Follow these steps to recover RabbitMQ on your controller node cluster in these cases:

  1. The order in which the nodes went down is key here. Locate the last node to go down as this will be used as the primary node when bringing the RabbitMQ cluster back up. You can review the timestamps in the /var/log/rabbitmq log file to determine what the last node was.

    Note
    Note

    The primary status of a node is transient, it only applies for the duration that this process is running. There is no long-term distinction between any of the nodes in your cluster. The primary node is simply the one that owns the RabbitMQ configuration database that will be synchronized across the cluster.

  2. Run the ardana-start.yml playbook specifying the primary node (aka the last node down determined in the first step):

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-start.yml -e rabbit_primary_hostname=<hostname>
    Note
    Note

    The <hostname> value will be the "shortname" for your node, as found in the /etc/hosts file.

If one of your controller nodes goes down

First step here is to determine whether the controller that went down is the primary RabbitMQ host or not. The primary host is going to be the first host member in the FND-RMQ group in the file below on your Cloud Lifecycle Manager:

ardana > ~/scratch/ansible/next/ardana/ansible/hosts/verb_hosts

In this example below, ardana-cp1-c1-m1-mgmt would be the primary:

[FND-RMQ-ccp-cluster1:children]
ardana-cp1-c1-m1-mgmt
ardana-cp1-c1-m2-mgmt
ardana-cp1-c1-m3-mgmt

If your primary RabbitMQ controller node has gone down and you need to bring it back up, you can follow these steps. In this playbook you are using the rabbit_primary_hostname parameter to specify the hostname for one of the other controller nodes in your environment hosting RabbitMQ, which will service as the primary node in the recovery. You will also use the --limit parameter to specify the controller node you are attempting to bring back up.

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts ardana-start.yml -e rabbit_primary_hostname=<new_primary_hostname> --limit <hostname_of_node_you_are_bringing_up>

If the node you need to bring back is not the primary RabbitMQ node then you can just run the ardana-start.yml playbook with the --limit parameter and your node should recover:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts ardana-start.yml --limit <hostname_of_node_you_are_bringing_up>

If you are replacing one or more of your controller nodes

The same general process noted above is used if you are removing or replacing one or more of your controller nodes.

If your node needs minor hardware repairs, but does not need to be replaced with a new node, you should use the ardana-stop.yml playbook with the --limit parameter to stop services on that node prior to removing it from the cluster.

  1. Log into the Cloud Lifecycle Manager.

  2. Run the rabbitmq-stop.yml playbook, specifying the hostname of the node you are removing, which will remove the node from the RabbitMQ cluster:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts rabbitmq-stop.yml --limit <hostname_of_node_you_are_removing>
  3. Run the ardana-stop.yml playbook, again specifying the hostname of the node you are removing, which will stop the rest of the services and prepare it to be removed:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-stop.yml --limit <hostname_of_node_you_are_removing>

If your node cannot be repaired and needs to be replaced with another baremetal node, any references to the replaced node must be removed from the RabbitMQ cluster. This is because RabbitMQ associates a cookie with each node in the cluster which is derived, in part, by the specific hardware. So it is possible to replace a hard drive in a node. However changing a motherboard or replacing the node with another node entirely may cause RabbitMQ to stop working. When this happens, the running RabbitMQ cluster must be edited from a running RabbitMQ node. The following steps show how to do this.

In this example, controller 3 is the node being replaced with the following steps:

  1. ardana > cd ~/scratch/ansible/next/ardana/ansible
  2. SSH to a running RabbitMQ cluster node.

    ardana > ssh cloud-cp1-rmq-mysql-m1-mgmt
  3. Force the cluster to forget the node you are removing (in this example, the controller 3 node).

    ardana > sudo rabbitmqctl forget_cluster_node \
    rabbit@cloud-cp1-rmq-mysql-m3-mgmt
  4. Confirm that the node has been removed.

    ardana > sudo rabbitmqctl cluster_status
  5. On the replacement node, information and services related to RabbitMQ must be removed.

    ardana > sudo systemctl stop rabbitmq-server
    ardana > sudo systemctl stop epmd.socket>
  6. Verify that the epmd service has stopped (kill it if it is still running).

    ardana > ps -eaf | grep epmd.
  7. Remove the Mnesia database directory.

    ardana > sudo rm -rf /var/lib/rabbitmq/mnesia
  8. Restart the RabbitMQ server.

    ardana > sudo systemctl start rabbitmq-server
  9. On the Cloud Lifecycle Manager, run the ardana-start.yml playbook.

If the node you are removing/replacing is your primary host then when you are adding it to your cluster then you will want to ensure that you specify a new primary host when doing so, as follows:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts ardana-start.yml -e rabbit_primary_hostname=<new_primary_hostname> --limit <hostname_of_node_you_are_adding>

If the node you are removing/replacing is not your primary host then you can add it as follows:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts ardana-start.yml --limit <hostname_of_node_you_are_adding>

If one of your controller nodes has rebooted or temporarily lost power

After a single reboot, RabbitMQ will not automatically restart. This is by design to protect your RabbitMQ cluster. To restart RabbitMQ, you should follow the process below.

If the rebooted node was your primary RabbitMQ host, you will specify a different primary hostname using one of the other nodes in your cluster:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts ardana-start.yml -e rabbit_primary_hostname=<new_primary_hostname> --limit <hostname_of_node_that_rebooted>

If the rebooted node was not the primary RabbitMQ host then you can just start it back up with this playbook:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts ardana-start.yml --limit <hostname_of_node_that_rebooted>

18.2.1.3 Recovering RabbitMQ

In this section we will show you how to check the status of RabbitMQ and how to do a variety of disaster recovery procedures.

Verifying the status of RabbitMQ

You can verify the status of RabbitMQ on each of your controller nodes by using the following steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Run the rabbitmq-status.yml playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts rabbitmq-status.yml
  3. If all is well, you should see an output similar to the following:

    PLAY RECAP ********************************************************************
    rabbitmq | status | Check RabbitMQ running hosts in cluster ------------- 2.12s
    rabbitmq | status | Check RabbitMQ service running ---------------------- 1.69s
    rabbitmq | status | Report status of RabbitMQ --------------------------- 0.32s
    -------------------------------------------------------------------------------
    Total: ------------------------------------------------------------------ 4.36s
    ardana-cp1-c1-m1-mgmt  : ok=2    changed=0    unreachable=0    failed=0
    ardana-cp1-c1-m2-mgmt  : ok=2    changed=0    unreachable=0    failed=0
    ardana-cp1-c1-m3-mgmt  : ok=2    changed=0    unreachable=0    failed=0

If one or more of your controller nodes are having RabbitMQ issues then continue reading, looking for the scenario that best matches yours.

RabbitMQ recovery after a small network outage

In the case of a transient network outage, the version of RabbitMQ included with SUSE OpenStack Cloud 9 is likely to recover automatically without any further action needed. However, if yours does not and the rabbitmq-status.yml playbook is reporting an issue then use the scenarios below to resolve your issues.

All of your controller nodes have gone down and using other methods have not brought RabbitMQ back up

If your RabbitMQ cluster is irrecoverable and you need rapid service recovery because other methods either cannot resolve the issue or you do not have time to investigate more nuanced approaches then we provide a disaster recovery playbook for you to use. This playbook will tear down and reset any RabbitMQ services. This does have an extreme effect on your services. The process will ensure that the RabbitMQ cluster is recreated.

  1. Log in to your Cloud Lifecycle Manager.

  2. Run the RabbitMQ disaster recovery playbook. This generally takes around two minutes.

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts rabbitmq-disaster-recovery.yml
  3. Run the reconfigure playbooks for both cinder (Block Storage) and heat (Orchestration), if those services are present in your cloud. These services are affected when the fan-out queues are not recovered correctly. The reconfigure generally takes around five minutes.

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts cinder-reconfigure.yml
    ardana > ansible-playbook -i hosts/verb_hosts heat-reconfigure.yml
    ardana > ansible-playbook -i hosts/verb_hosts kronos-server-configure.yml
  4. If you need to do a safe recovery of all the services in your environment then you can use this playbook. This is a more lengthy process as all services are inspected.

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-reconfigure.yml

One of your controller nodes has gone down and using other methods have not brought RabbitMQ back up

This disaster recovery procedure has the same caveats as the preceding one, but the steps differ.

If your primary RabbitMQ controller node has gone down and you need to perform a disaster recovery, use this playbook from your Cloud Lifecycle Manager:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts rabbitmq-disaster-recovery.yml -e rabbit_primary_hostname=<new_primary_hostname> --limit <hostname_of_node_that_needs_recovered>

If the controller node is not your primary, you can use this playbook:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts rabbitmq-disaster-recovery.yml --limit <hostname_of_node_that_needs_recovered>

No reconfigure playbooks are needed because all of the fan-out exchanges are maintained by the running members of your RabbitMQ cluster.

18.3 Troubleshooting Compute service

Troubleshooting scenarios with resolutions for the nova service.

nova offers scalable, on-demand, self-service access to compute resources. You can use this guide to help with known issues and troubleshooting of nova services.

18.3.1 How can I reset the state of a compute instance?

If you have an instance that is stuck in a non-Active state, such as Deleting or Rebooting and you want to reset the state so you can interact with the instance again, there is a way to do this.

The OSC command-line tool command, openstack server set –state, allows you to reset the state of a server.

Here is the content of the help information about the command which shows the syntax:

$ openstack help reset-state
        usage: openstack set -state [--active] <server> [<server> ...]

        Reset the state of a server.

        Positional arguments:
        <server>  Name or ID of server(s).

        Optional arguments:
        --active  Request the server be reset to "active" state instead of "error"
        state (the default).

If you had an instance that was stuck in a Rebooting state you would use this command to reset it back to Active:

openstack server set –state --active <instance_id>

18.3.2 Enabling the migrate or resize functions in nova post-installation when using encryption

If you have used encryption for your data when running the configuration processor during your cloud deployment and are enabling the nova resize and migrate functionality after the initial installation, there is an issue that arises if you have made additional configuration changes that required you to run the configuration processor before enabling these features.

You will only experience an issue if you have enabled encryption. If you haven't enabled encryption, then there is no need to follow the procedure below. If you are using encryption and you have made a configuration change and run the configuration processor after your initial install or upgrade, and you have run the ready-deployment.yml playbook, and you want to enable migrate or resize in nova, then the following steps will allow you to proceed. Note that the ansible vault key referred to below is the encryption key that you have provided to the configuration processor.

  1. Log in to the Cloud Lifecycle Manager.

  2. Checkout the ansible branch of your local git:

    cd ~/openstack
    git checkout ansible
  3. Do a git log, and pick the previous commit:

    git log

    In this example below, the commit is ac54d619b4fd84b497c7797ec61d989b64b9edb3:

    $ git log
    
                  commit 69f95002f9bad0b17f48687e4d97b2a791476c6a
                  Merge: 439a85e ac54d61
                  Author: git user <user@company.com>
                  Date:   Fri May 6 09:08:55 2016 +0000
    
                  Merging promotion of saved output
    
                  commit 439a85e209aeeca3ab54d1a9184efb01604dbbbb
                  Author: git user <user@company.com>
                  Date:   Fri May 6 09:08:24 2016 +0000
    
                  Saved output from CP run on 1d3976dac4fd7e2e78afad8d23f7b64f9d138778
    
                  commit ac54d619b4fd84b497c7797ec61d989b64b9edb3
                  Merge: a794083 66ffe07
                  Author: git user <user@company.com>
                  Date:   Fri May 6 08:32:04 2016 +0000
    
                  Merging promotion of saved output
  4. Checkout the commit:

    git checkout <commit_ID>

    Using the same example above, here is the command:

    $ git checkout ac54d619b4fd84b497c7797ec61d989b64b9edb3
                  Note: checking out 'ac54d619b4fd84b497c7797ec61d989b64b9edb3'.
    
                  You are in 'detached HEAD' state. You can look around, make experimental
                  changes and commit them, and you can discard any commits you make in this
                  state without impacting any branches by performing another checkout.
    
                  If you want to create a new branch to retain commits you create, you may
                  do so (now or later) by using -b with the checkout command again. Example:
    
                  git checkout -b new_branch_name
    
                  HEAD is now at ac54d61... Merging promotion of saved output
  5. Change to the ansible output directory:

    cd ~/openstack/my_cloud/stage/ansible/group_vars/
  6. View the group_vars file from the ansible vault - it will be of the form below, with your compute cluster name being the indicator:

    <cloud name>-<control plane name>-<compute cluster name>

    View this group_vars file from the ansible vault with this command which will prompt you for your vault password:

    ansible-vault view <group_vars_file>
  7. Search the contents of this file for the nova_ssh_key section which will contain both the private and public SSH keys which you should then save into a temporary file so you can use it in a later step.

    Here is an example snippet, with the bold part being what you need to save:

    NOV_KVM:
                    vars:
                                  nova_ssh_key:
                      private: '-----BEGIN RSA PRIVATE KEY-----
                      MIIEpAIBAAKCAQEAv/hhekzykD2K8HnVNBKZcJWYrVlUyb6gR8cvE6hbh2ISzooA
                      jQc3xgglIwpt5TuwpTY3LL0C4PEHObxy9WwqXTHBZp8jg/02RzD02bEcZ1WT49x7
                      Rj8f5+S1zutHlDv7PwEIMZPAHA8lihfGFG5o+QHUmsUHgjShkWPdHXw1+6mCO9V/
                      eJVZb3nDbiunMOBvyyk364w+fSzes4UDkmCq8joDa5KkpTgQK6xfw5auEosyrh8D
                      zocN/JSdr6xStlT6yY8naWziXr7p/QhG44RPD9SSD7dhkyJh+bdCfoFVGdjmF8yA
                      h5DlcLu9QhbJ/scb7yMP84W4L5GwvuWCCFJTHQIDAQABAoIBAQCCH5O7ecMFoKG4
                      JW0uMdlOJijqf93oLk2oucwgUANSvlivJX4AGj9k/YpmuSAKvS4cnqZBrhDwdpCG
                      Q0XNM7d3mk1VCVPimNWc5gNiOBpftPNdBcuNryYqYq4WBwdq5EmGyGVMbbFPk7jH
                      ZRwAJ2MCPoplKl7PlGtcCMwNu29AGNaxCQEZFmztXcEFdMrfpTh3kuBI536pBlEi
                      Srh23mRILn0nvLXMAHwo94S6bI3JOQSK1DBCwtA52r5YgX0nkZbi2MvHISY1TXBw
                      SiWgzqW8dakzVu9UNif9nTDyaJDpU0kr0/LWtBQNdcpXnDSkHGjjnIm2pJVBC+QJ
                      SM9o8h1lAoGBANjGHtG762+dNPEUUkSNWVwd7tvzW9CZY35iMR0Rlux4PO+OXwNq
                      agldHeUpgG1MPl1ya+rkf0GD62Uf4LHTDgaEkUfiXkYtcJwHbjOnj3EjZLXaYMX2
                      LYBE0bMKUkQCBdYtCvZmo6+dfC2DBEWPEhvWi7zf7o0CJ9260aS4UHJzAoGBAOK1
                      P//K7HBWXvKpY1yV2KSCEBEoiM9NA9+RYcLkNtIy/4rIk9ShLdCJQVWWgDfDTfso
                      sJKc5S0OtOsRcomvv3OIQD1PvZVfZJLKpgKkt20/w7RwfJkYC/jSjQpzgDpZdKRU
                      vRY8P5iryptleyImeqV+Vhf+1kcH8t5VQMUU2XAvAoGATpfeOqqIXMpBlJqKjUI2
                      QNi1bleYVVQXp43QQrrK3mdlqHEU77cYRNbW7OwUHQyEm/rNN7eqj8VVhi99lttv
                      fVt5FPf0uDrnVhq3kNDSh/GOJQTNC1kK/DN3WBOI6hFVrmZcUCO8ewJ9MD8NQG7z
                      4NXzigIiiktayuBd+/u7ZxMCgYEAm6X7KaBlkn8KMypuyIsssU2GwHEG9OSYay9C
                      Ym8S4GAZKGyrakm6zbjefWeV4jMZ3/1AtXg4tCWrutRAwh1CoYyDJlUQAXT79Phi
                      39+8+6nSsJimQunKlmvgX7OK7wSp24U+SPzWYPhZYzVaQ8kNXYAOlezlquDfMxxv
                      GqBE5QsCgYA8K2p/z2kGXCNjdMrEM02reeE2J1Ft8DS/iiXjg35PX7WVIZ31KCBk
                      wgYTWq0Fwo2W/EoJVl2o74qQTHK0Bs+FTnR2nkVF3htEOAW2YXQTTN2rEsHmlQqE
                      A9iGTNwm9hvzbvrWeXtx8Zk/6aYfsXCoxq193KglS40shOCaXzWX0w==
                      -----END RSA PRIVATE KEY-----'
                      public: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC/+GF6TPKQPYrwedU0Epl
                      wlZitWVTJvqBHxy8TqFuHYhLOigCNBzfGCCUjCm3lO7ClNjcsvQLg8Qc5vHL1bCpdMc
                      FmnyOD/TZHMPTZsRxnVZPj3HtGPx/n5LXO60eUO/s/AQgxk8AcDyWKF8YUbmj5Ad
                      SaxQeCNKGRY90dfDX7qYI71X94lVlvecNuK6cw4G/LKTfrjD59LN6zhQOSYKryOgNrkq
                      SlOBArrF/Dlq4SizKuHwPOhw38lJ2vrFK2VPrJjydpbOJevun9CEbjhE8P1JIPt2GTImH5t0
                      J+gVUZ2OYXzICHkOVwu71CFsn+xxvvIw/zhbgvkbC+5YIIUlMd
                      Generated Key for nova User
                    NTP_CLI:
  8. Switch back to the site branch by checking it out:

    cd ~/openstack
    git checkout site
  9. Navigate to your group_vars directory in this branch:

    cd ~/scratch/ansible/next/ardana/ansible/group_vars
  10. Edit your compute group_vars file, which will prompt you for your vault password:

    ansible-vault edit <group_vars_file>
                  Vault password:
                  Decryption successful
  11. Search the contents of this file for the nova_ssh_key section and replace the private and public keys with the contents that you had saved in a temporary file in step #7 earlier.

  12. Remove the temporary file that you created earlier. You are now ready to run the deployment. For information about enabling nova resizing and migration, see Section 6.4, “Enabling the Nova Resize and Migrate Features”.

18.3.3 Compute (ESX)

Unable to Create Instance Snapshot when Instance is Active

There is a known issue with VMWare vCenter where if you have a compute instance in Active state you will receive the error below when attempting to take a snapshot of it:

An error occurred while saving the snapshot: Failed to quiesce the virtual machine

The workaround for this issue is to stop the instance. Here are steps to achieve this using the command line tool:

  1. Stop the instance using the OpenStackClient:

    openstack server stop <instance UUID>
  2. Take the snapshot of the instance.

  3. Start the instance back up:

    openstack server start <instance UUID>

18.3.4 How to archive deleted instances from the database

The nova-reconfigure.yml playbook can take a long time to run if the database has a large number of deleted instances.

To find the number of rows being used by deleted instances:

sudo mysql nova -e "select count(*) from instances where vm_state='deleted';"

To archive a batch of 1000 deleted instances to shadow tables:

sudo /opt/stack/service/nova-api/venv/bin/nova-manage \
    --config-dir /opt/stack/service/nova-api/etc/nova/ \
    db archive_deleted_rows --verbose --max_rows 1000

18.4 Network Service Troubleshooting

Troubleshooting scenarios with resolutions for the Networking service.

18.4.1 CVR HA - Split-brain result of failover of L3 agent when master comes back up

This situation is specific to when L3 HA is configured and a network failure occurs to the node hosting the currently active l3 agent. L3 HA is intended to provide HA in situations where the l3-agent crashes or the node hosting an l3-agent crashes/restarts. In the case of a physical networking issue which isolates the active l3 agent, the stand-by l3-agent takes over but when the physical networking issue is resolved, traffic to the VMs is disrupted due to a "split-brain" situation in which traffic is split over the two L3 agents. The solution is to restart the L3-agent that was originally the master.

18.4.2 OVSvApp Loses Connectivity with vCenter

If the OVSvApp loses connectivity with the vCenter cluster, you receive the following errors:

  1. The OVSvApp VM will go into ERROR state

  2. The OVSvApp VM will not get IP address

When you see these symptoms:

  1. Restart the OVSvApp agent on the OVSvApp VM.

  2. Execute the following command to restart the Network (neutron) service:

    sudo service neutron-ovsvapp-agent restart

18.4.3 Fail over a plain CVR router because the node became unavailable:

  1. Get a list of l3 agent UUIDs which can be used in the commands that follow

     openstack network agent list | grep l3
  2. Determine the current host

     openstack network agent list –routers <router uuid>
  3. Remove the router from the current host

    openstack network agent remove router –agent-type l3 <current l3 agent uuid> <router uuid>
  4. Add the router to a new host

    openstack network agent add router –agent-type l3 <new l3 agent uuid> <router uuid>

18.4.4 Trouble setting maximum transmission units (MTU)

See Section 10.4.11, “Configuring Maximum Transmission Units in neutron” for more information.

18.4.5 Floating IP on allowed_address_pair port with DVR-routed networks allowed_address_pair

You may notice this issue: If you have an allowed_address_pair associated with multiple virtual machine (VM) ports, and if all the VM ports are ACTIVE, then the allowed_address_pair port binding will have the last ACTIVE VM's binding host as its bound host.

In addition, you may notice that if the floating IP is assigned to the allowed_address_pair that is bound to multiple VMs that are ACTIVE, then the floating IP will not work with DVR routers. This is different from the centralized router behavior where it can handle unbound allowed_address_pair ports that are associated with floating IPs.

Currently we support allowed_address_pair ports with DVR only if they have floating IPs enabled, and have just one ACTIVE port.

Using the CLI, you can follow these steps:

  1. Create a network to add the host to:

    $ openstack network create vrrp-net
  2. Attach a subnet to that network with a specified allocation-pool range:

    $ openstack subnet create  --name vrrp-subnet --allocation-pool start=10.0.0.2,end=10.0.0.200 vrrp-net 10.0.0.0/24
  3. Create a router, uplink the vrrp-subnet to it, and attach the router to an upstream network called public:

    $ openstack router create router1
    $ openstack router add subnet router1 vrrp-subnet
    $ openstack router set router1 public

    Create a security group called vrrp-sec-group and add ingress rules to allow ICMP and TCP port 80 and 22:

    $ openstack security group create vrrp-sec-group
    $ openstack security group rule create  --protocol icmp vrrp-sec-group
    $ openstack security group rule create  --protocol tcp  --port-range-min80 --port-range-max80 vrrp-sec-group
    $ openstack security group rule create  --protocol tcp  --port-range-min22 --port-range-max22 vrrp-sec-group
  4. Next, boot two instances:

    $ openstack server create --num-instances 2 --image ubuntu-12.04 --flavor 1 --nic net-id=24e92ee1-8ae4-4c23-90af-accb3919f4d1 vrrp-node --security_groups vrrp-sec-group
  5. When you create two instances, make sure that both the instances are not in ACTIVE state before you associate the allowed_address_pair. The instances:

    $ openstack server list
    +--------------------------------------+-------------------------------------------------+--------+------------+-------------+--------------------------------------------------------+
    | ID                                   | Name                                            | Status | Task State | Power State | Networks                                               |
    +--------------------------------------+-------------------------------------------------+--------+------------+-------------+--------------------------------------------------------+
    | 15b70af7-2628-4906-a877-39753082f84f | vrrp-node-15b70af7-2628-4906-a877-39753082f84f | ACTIVE  | -          | Running     | vrrp-net=10.0.0.3                                      |
    | e683e9d1-7eea-48dd-9d3a-a54cf9d9b7d6 | vrrp-node-e683e9d1-7eea-48dd-9d3a-a54cf9d9b7d6 | DOWN    | -          | Running     | vrrp-net=10.0.0.4                                      |
    +--------------------------------------+-------------------------------------------------+--------+------------+-------------+--------------------------------------------------------+
  6. Create a port in the VRRP IP range that was left out of the ip-allocation range:

    $ openstack port create --fixed-ip ip_address=10.0.0.201 --security-group vrrp-sec-group vrrp-net
    Created a new port:
    +-----------------------+-----------------------------------------------------------------------------------+
    | Field                 | Value                                                                             |
    +-----------------------+-----------------------------------------------------------------------------------+
    | admin_state_up        | True                                                                              |
    | allowed_address_pairs |                                                                                   |
    | device_id             |                                                                                   |
    | device_owner          |                                                                                   |
    | fixed_ips             | {"subnet_id": "94a0c371-d37c-4796-821e-57c2a8ec65ae", "ip_address": "10.0.0.201"} |
    | id                    | 6239f501-e902-4b02-8d5c-69062896a2dd                                              |
    | mac_address           | fa:16:3e:20:67:9f                                                                 |
    | name                  |                                                                                   |
    | network_id            | 24e92ee1-8ae4-4c23-90af-accb3919f4d1                                              |
    | port_security_enabled | True                                                                              |
    | security_groups       | 36c8131f-d504-4bcc-b708-f330c9f6b67a                                              |
    | status                | DOWN                                                                              |
    | tenant_id             | d4e4332d5f8c4a8eab9fcb1345406cb0                                                  |
    +-----------------------+-----------------------------------------------------------------------------------+
  7. Another thing to cross check after you associate the allowed_address_pair port to the VM port, is whether the allowed_address_pair port has inherited the VM's host binding:

    $ neutron --os-username admin --os-password ZIy9xitH55 --os-tenant-name admin port-show f5a252b2-701f-40e9-a314-59ef9b5ed7de
    +-----------------------+--------------------------------------------------------------------------------------------------------+
    | Field                 | Value                                                                                                  |
    +-----------------------+--------------------------------------------------------------------------------------------------------+
    | admin_state_up        | True                                                                                                   |
    | allowed_address_pairs |                                                                                                        |
    | {color:red}binding:host_id{color} | ...-cp1-comp0001-mgmt                                                                      |
    | binding:profile       | {}                                                                                                     |
    | binding:vif_details   | {"port_filter": true, "ovs_hybrid_plug": true}                                                         |
    | binding:vif_type      | ovs                                                                                                    |
    | binding:vnic_type     | normal                                                                                                 |
    | device_id             |                                                                                                        |
    | device_owner          | compute:None                                                                                           |
    | dns_assignment        | {"hostname": "host-10-0-0-201", "ip_address": "10.0.0.201", "fqdn": "host-10-0-0-201.openstacklocal."} |
    | dns_name              |                                                                                                        |
    | extra_dhcp_opts       |                                                                                                        |
    | fixed_ips             | {"subnet_id": "94a0c371-d37c-4796-821e-57c2a8ec65ae", "ip_address": "10.0.0.201"}                      |
    | id                    | 6239f501-e902-4b02-8d5c-69062896a2dd                                                                   |
    | mac_address           | fa:16:3e:20:67:9f                                                                                      |
    | name                  |                                                                                                        |
    | network_id            | 24e92ee1-8ae4-4c23-90af-accb3919f4d1                                                                   |
    | port_security_enabled | True                                                                                                   |
    | security_groups       | 36c8131f-d504-4bcc-b708-f330c9f6b67a                                                                   |
    | status                | DOWN                                                                                                   |
    | tenant_id             | d4e4332d5f8c4a8eab9fcb1345406cb0                                                                       |
    +-----------------------+--------------------------------------------------------------------------------------------------------+
  8. Note that you were allocated a port with the IP address 10.0.0.201 as requested. Next, associate a floating IP to this port to be able to access it publicly:

    $ openstack floating ip create --port-id=6239f501-e902-4b02-8d5c-69062896a2dd public
    Created a new floatingip:
    +---------------------+--------------------------------------+
    | Field               | Value                                |
    +---------------------+--------------------------------------+
    | fixed_ip_address    | 10.0.0.201                           |
    | floating_ip_address | 10.36.12.139                         |
    | floating_network_id | 3696c581-9474-4c57-aaa0-b6c70f2529b0 |
    | id                  | a26931de-bc94-4fd8-a8b9-c5d4031667e9 |
    | port_id             | 6239f501-e902-4b02-8d5c-69062896a2dd |
    | router_id           | 178fde65-e9e7-4d84-a218-b1cc7c7b09c7 |
    | tenant_id           | d4e4332d5f8c4a8eab9fcb1345406cb0     |
    +---------------------+--------------------------------------+
  9. Now update the ports attached to your VRRP instances to include this IP address as an allowed-address-pair so they will be able to send traffic out using this address. First find the ports attached to these instances:

    $ openstack port list -- --network_id=24e92ee1-8ae4-4c23-90af-accb3919f4d1
    +--------------------------------------+------+-------------------+-----------------------------------------------------------------------------------+
    | id                                   | name | mac_address       | fixed_ips                                                                         |
    +--------------------------------------+------+-------------------+-----------------------------------------------------------------------------------+
    | 12bf9ea4-4845-4e2c-b511-3b8b1ad7291d |      | fa:16:3e:7a:7b:18 | {"subnet_id": "94a0c371-d37c-4796-821e-57c2a8ec65ae", "ip_address": "10.0.0.4"}   |
    | 14f57a85-35af-4edb-8bec-6f81beb9db88 |      | fa:16:3e:2f:7e:ee | {"subnet_id": "94a0c371-d37c-4796-821e-57c2a8ec65ae", "ip_address": "10.0.0.2"}   |
    | 6239f501-e902-4b02-8d5c-69062896a2dd |      | fa:16:3e:20:67:9f | {"subnet_id": "94a0c371-d37c-4796-821e-57c2a8ec65ae", "ip_address": "10.0.0.201"} |
    | 87094048-3832-472e-a100-7f9b45829da5 |      | fa:16:3e:b3:38:30 | {"subnet_id": "94a0c371-d37c-4796-821e-57c2a8ec65ae", "ip_address": "10.0.0.1"}   |
    | c080dbeb-491e-46e2-ab7e-192e7627d050 |      | fa:16:3e:88:2e:e2 | {"subnet_id": "94a0c371-d37c-4796-821e-57c2a8ec65ae", "ip_address": "10.0.0.3"}   |
    +--------------------------------------+------+-------------------+-----------------------------------------------------------------------------------+
  10. Add this address to the ports c080dbeb-491e-46e2-ab7e-192e7627d050 and 12bf9ea4-4845-4e2c-b511-3b8b1ad7291d which are 10.0.0.3 and 10.0.0.4 (your vrrp-node instances):

    $ openstack port set  c080dbeb-491e-46e2-ab7e-192e7627d050 --allowed_address_pairs list=truetype=dict ip_address=10.0.0.201
    $ openstack port set  12bf9ea4-4845-4e2c-b511-3b8b1ad7291d --allowed_address_pairs list=truetype=dict ip_address=10.0.0.201
  11. The allowed-address-pair 10.0.0.201 now shows up on the port:

    $ openstack port show 12bf9ea4-4845-4e2c-b511-3b8b1ad7291d
    +-----------------------+---------------------------------------------------------------------------------+
    | Field                 | Value                                                                           |
    +-----------------------+---------------------------------------------------------------------------------+
    | admin_state_up        | True                                                                            |
    | allowed_address_pairs | {"ip_address": "10.0.0.201", "mac_address": "fa:16:3e:7a:7b:18"}                |
    | device_id             | e683e9d1-7eea-48dd-9d3a-a54cf9d9b7d6                                            |
    | device_owner          | compute:None                                                                    |
    | fixed_ips             | {"subnet_id": "94a0c371-d37c-4796-821e-57c2a8ec65ae", "ip_address": "10.0.0.4"} |
    | id                    | 12bf9ea4-4845-4e2c-b511-3b8b1ad7291d                                            |
    | mac_address           | fa:16:3e:7a:7b:18                                                               |
    | name                  |                                                                                 |
    | network_id            | 24e92ee1-8ae4-4c23-90af-accb3919f4d1                                            |
    | port_security_enabled | True                                                                            |
    | security_groups       | 36c8131f-d504-4bcc-b708-f330c9f6b67a                                            |
    | status                | ACTIVE                                                                          |
    | tenant_id             | d4e4332d5f8c4a8eab9fcb1345406cb0                                                |

18.4.6 OpenStack traffic that must traverse VXLAN tunnel dropped when using HPE 5930 switch

Cause: UDP destination port 4789 is conflicting with OpenStack VXLAN traffic.

There is a configuration setting you can use in the switch to configure the port number the HPN kit will use for its own VXLAN tunnels. Setting this to a port number other than the one neutron will use by default (4789) will keep the HPN kit from absconding with neutron's VXLAN traffic. Specifically:

Parameters:

port-number: Specifies a UDP port number in the range of 1 to 65535. As a best practice, specify a port number in the range of 1024 to 65535 to avoid conflict with well-known ports.

Usage guidelines:

You must configure the same destination UDP port number on all VTEPs in a VXLAN.

Examples

# Set the destination UDP port number to 6666 for VXLAN packets.
<Sysname> system-view
[Sysname] vxlan udp-port 6666

Use vxlan udp-port to configure the destination UDP port number of VXLAN packets.   Mandatory for all VXLAN packets to specify a UDP port Default The destination UDP port number is 4789 for VXLAN packets.

OVS can be configured to use a different port number itself:

# (IntOpt) The port number to utilize if tunnel_types includes 'vxlan'. By
# default, this will make use of the Open vSwitch default value of '4789' if
# not specified.
#
# vxlan_udp_port =
# Example: vxlan_udp_port = 8472
#

18.4.7 Issue: PCI-PT virtual machine gets stuck at boot

If you are using a machine that uses Intel NICs, if the PCI-PT virtual machine gets stuck at boot, the boot agent should be disabled.

When Intel cards are used for PCI-PT, sometimes the tenant virtual machine gets stuck at boot. If this happens, you should download Intel bootutils and use it to disable the bootagent.

Use the following steps:

  1. Download preebot.tar.gz from the Intel website.

  2. Untar the preboot.tar.gz file on the compute host where the PCI-PT virtual machine is to be hosted.

  3. Go to path ~/APPS/BootUtil/Linux_x64 and then run following command:

    ./bootutil64e -BOOTENABLE disable -all
  4. Now boot the PCI-PT virtual machine and it should boot without getting stuck.

18.5 Troubleshooting the Image (glance) Service

Troubleshooting scenarios with resolutions for the glance service. We have gathered some of the common issues and troubleshooting steps that will help when resolving issues that occur with the glance service.

18.5.1 Images Created in Horizon UI Get Stuck in a Queued State

When creating a new image in the horizon UI you will see the option for Image Location which allows you to enter a HTTP source to use when creating a new image for your cloud. However, this option is disabled by default for security reasons. This results in any new images created via this method getting stuck in a Queued state.

We cannot guarantee the security of any third party sites you use as image sources and the traffic goes over HTTP (non-SSL) traffic.

Resolution: You will need your cloud administrator to enable the HTTP store option in glance for your cloud.

Here are the steps to enable this option:

  1. Log in to the Cloud Lifecycle Manager.

  2. Edit the file below:

    ~/openstack/ardana/ansible/roles/GLA-API/templates/glance-api.conf.j2
  3. Locate the glance store options and add the http value in the stores field. It will look like this:

    [glance_store]
    stores = {{ glance_stores }}

    Change this to:

    [glance_store]
    stores = {{ glance_stores }},http
  4. Commit your configuration to the Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”, as follows:

    cd ~/openstack/ardana/ansible
    git add -A
    git commit -m "adding HTTP option to glance store list"
  5. Run the configuration processor with this command:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost config-processor-run.yml
  6. Use the playbook below to create a deployment directory:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost ready-deployment.yml
  7. Run the glance service reconfigure playbook which will update these settings:

    cd ~/scratch/ansible/next/ardana/ansible/
    ansible-playbook -i hosts/verb_hosts glance-reconfigure.yml

18.6 Storage Troubleshooting

Troubleshooting scenarios with resolutions for swift services.

18.6.1 Block Storage Troubleshooting

The block storage service utilizes OpenStack cinder and can integrate with multiple back-ends including 3Par, SUSE Enterprise Storage, and Ceph. Failures may exist at the cinder API level, an operation may fail, or you may see an alarm trigger in the monitoring service. These may be caused by configuration problems, network issues, or issues with your servers or storage back-ends. The purpose of this page and section is to describe how the service works, where to find additional information, some of the common problems that come up, and how to address them.

18.6.1.1 Where to find information

When debugging block storage issues it is helpful to understand the deployment topology and know where to locate the logs with additional information.

The cinder service consists of:

  • An API service, typically deployed and active on the controller nodes.

  • A scheduler service, also typically deployed and active on the controller nodes.

  • A volume service, which is deployed on all of the controller nodes but only active on one of them.

  • A backup service, which is deployed on the same controller node as the volume service.

Image

You can refer to your configuration files (usually located in ~/openstack/my_cloud/definition/ on the Cloud Lifecycle Manager) for specifics about where your services are located. They will usually be located on the controller nodes.

cinder uses a MariaDB database and communicates between components by consuming messages from a RabbitMQ message service.

The cinder API service is layered underneath a HAProxy service and accessed using a virtual IP address maintained using keepalived.

If any of the cinder components is not running on its intended host then an alarm will be raised. Details on how to resolve these alarms can be found on our Section 18.1.1, “Alarm Resolution Procedures” page. You should check the logs for the service on the appropriate nodes. All cinder logs are stored in /var/log/cinder/ and all log entries above INFO level are also sent to the centralized logging service. For details on how to change the logging level of the cinder service, see Section 13.2.6, “Configuring Settings for Other Services”.

In order to get the full context of an error you may need to examine the full log files on individual nodes. Note that if a component runs on more than one node you will need to review the logs on each of the nodes that component runs on. Also remember that as logs rotate that the time interval you are interested in may be in an older log file.

Log locations:

/var/log/cinder/cinder-api.log - Check this log if you have endpoint or connectivity issues

/var/log/cinder/cinder-scheduler.log - Check this log if the system cannot assign your volume to a back-end

/var/log/cinder/cinder-backup.log - Check this log if you have backup or restore issues

/var/log/cinder-cinder-volume.log - Check here for failures during volume creation

/var/log/nova/nova-compute.log - Check here for failures with attaching volumes to compute instances

You can also check the logs for the database and/or the RabbitMQ service if your cloud exhibits database or messaging errors.

If the API servers are up and running but the API is not reachable then checking the HAProxy logs on the active keepalived node would be the place to look.

If you have errors attaching volumes to compute instances using the nova API then the logs would be on the compute node associated with the instance. You can use the following command to determine which node is hosting the instance:

openstack server show <instance_uuid>

Then you can check the logs located at /var/log/nova/nova-compute.log on that compute node.

18.6.1.2 Understanding the cinder volume states

Once the topology is understood, if the issue with the cinder service relates to a specific volume then you should have a good understanding of what the various states a volume can be in are. The states are:

  • attaching

  • available

  • backing-up

  • creating

  • deleting

  • downloading

  • error

  • error attaching

  • error deleting

  • error detaching

  • error extending

  • error restoring

  • in-use

  • extending

  • restoring

  • restoring backup

  • retyping

  • uploading

The common states are in-use which indicates a volume is currently attached to a compute instance and available means the volume is created on a back-end and is free to be attached to an instance. All -ing states are transient and represent a transition. If a volume stays in one of those states for too long indicating it is stuck, or if it fails and goes into an error state, you should check for failures in the logs.

18.6.1.3 Initial troubleshooting steps

These should be the initial troubleshooting steps you go through.

  1. If you have noticed an issue with the service, you should check your monitoring system for any alarms that may have triggered. See Section 18.1.1, “Alarm Resolution Procedures” for resolution steps for those alarms.

  2. Check if the cinder API service is active by listing the available volumes from the Cloud Lifecycle Manager:

    source ~/service.osrc
    openstack volume list

18.6.1.4 Common failures

Alerts from the cinder service

Check for alerts associated with the block storage service, noting that these could include alerts related to the server nodes being down, alerts related to the messaging and database services, or the HAProxy and keepalived services, as well as alerts directly attributed to the block storage service.

The Operations Console provides a web UI method for checking alarms.

cinder volume service is down

The cinder volume service could be down if the server hosting the volume service fails. (Running the command openstack volume service list will show the state of the volume service.) In this case, follow the documented procedure linked below to start the volume service on another controller node. See Section 8.1.3, “Managing cinder Volume and Backup Services” for details.

Creating a cinder bootable volume fails

When creating a bootable volume from an image, your cinder volume must be larger than the Virtual Size (raw size) of your image or creation will fail with an error.

When creating your disk model for nodes that will have the cinder volume role, make sure that there is sufficient disk space allocated for temporary space for image conversion if you will be creating bootable volumes. Allocate enough space to the filesystem as would be needed to contain the raw size of images to be used for bootable volumes. For example, Windows images can be quite large in raw format.

By default, cinder uses /var/lib/cinder for image conversion and this will be on the root filesystem unless it is explicitly separated. You can ensure there is enough space by ensuring that the root file system is sufficiently large, or by creating a logical volume mounted at /var/lib/cinder in the disk model when installing the system.

If your system is already installed, use these steps to update this:

  1. Edit the configuration item image_conversion_dir in cinder.conf.j2 to point to another location with more disk space. Make sure that the new directory location has the same ownership and permissions as /var/lib/cinder (owner:cinder group:cinder. mode 0750).

  2. Then run this playbook:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts cinder-reconfigure.yml

API-level failures

If the API is inaccessible, determine if the API service has halted on the controller nodes. If a single instance of cinder-api goes down but other instances remain online on other controllers, load balancing would typically automatically direct all traffic to the online nodes. The cinder-status.yml playbook can be used to report on the health of the API service from the Cloud Lifecycle Manager:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts cinder-status.yml

Service failures can be diagnosed by reviewing the logs in centralized logging or on the individual controller nodes.

Note
Note

After a controller node is rebooted, you must make sure to run the ardana-start.yml playbook to ensure all the services are up and running. For more information, see Section 15.2.3.1, “Restarting Controller Nodes After a Reboot”.

If the API service is returning an error code, look for the error message in the API logs on all API nodes. Successful completions would be logged like this:

2016-04-25 10:09:51.107 30743 INFO eventlet.wsgi.server [req-a14cd6f3-6c7c-4076-adc3-48f8c91448f6
dfb484eb00f94fb39b5d8f5a894cd163 7b61149483ba4eeb8a05efa92ef5b197 - - -] 192.168.186.105 - - [25/Apr/2016
10:09:51] "GET /v2/7b61149483ba4eeb8a05efa92ef5b197/volumes/detail HTTP/1.1" 200 13915 0.235921

where 200 represents HTTP status 200 for a successful completion. Look for a line with your status code and then examine all entries associated with the request id. The request ID in the successful completion is highlighted in bold above.

The request may have failed at the scheduler or at the volume or backup service and you should also check those logs at the time interval of interest, noting that the log file of interest may be on a different node.

Operations that do not complete

If you have started an operation, such as creating or deleting a volume, that does not complete, the cinder volume may be stuck in a state. You should follow the procedures for detaling with stuck volumes.

There are six transitory states that a volume can get stuck in:

StateDescription
creatingThe cinder volume manager has sent a request to a back-end driver to create a volume, but has not received confirmation that the volume is available.
attachingcinder has received a request from nova to make a volume available for attaching to an instance but has not received confirmation from nova that the attachment is complete.
detachingcinder has received notification from nova that it will detach a volume from an instance but has not received notification that the detachment is complete.
deletingcinder has received a request to delete a volume but has not completed the operation.
backing-upcinder backup manager has started to back a volume up to swift, or some other backup target, but has not completed the operation.
restoringcinder backup manager has started to restore a volume from swift, or some other backup target, but has not completed the operation.

At a high level, the steps that you would take to address any of these states are similar:

  1. Confirm that the volume is actually stuck, and not just temporarily blocked.

  2. Where possible, remove any resources being held by the volume. For example, if a volume is stuck detaching it may be necessary to remove associated iSCSI or DM devices on the compute node.

  3. Reset the state of the volume to an appropriate state, for example to available or error.

  4. Do any final cleanup. For example, if you reset the state to error you can then delete the volume.

The next sections will describe specific steps you can take for volumes stuck in each of the transitory states.

Volumes stuck in Creating

Broadly speaking, there are two possible scenarios where a volume would get stuck in creating. The cinder-volume service could have thrown an exception while it was attempting to create the volume, and failed to handle the exception correctly. Or the volume back-end could have failed, or gone offline, after it received the request from cinder to create the volume.

These two cases are different in that for the second case you will need to determine the reason the back-end is offline and restart it. Often, when the back-end has been restarted, the volume will move from creating to available so your issue will be resolved.

If you can create volumes successfully on the same back-end as the volume stuck in creating then the back-end is not down. So you will need to reset the state for the volume and then delete it.

To reset the state of a volume you can use the openstack volume set --state command. You can use either the UUID or the volume name of the stuck volume.

For example, here is a volume list where we have a stuck volume:

$ openstack volume list
+--------------------------------------+-----------+------+------+-------------+------------+
|                  ID                  |   Status  | Name | Size | Volume Type |Attached to |
+--------------------------------------+-----------+------+------+-------------+------------+
| 14b76133-e076-4bd3-b335-fa67e09e51f6 | creating  | vol1 |  1   |      -      |            |
+--------------------------------------+-----------+------+------+-------------+------------+

You can reset the state by using the openstack volume set --state command, like this:

openstack volume set --state --state error 14b76133-e076-4bd3-b335-fa67e09e51f6

Confirm that with another listing:

$ openstack volume list
+--------------------------------------+-----------+------+------+-------------+------------+
|                  ID                  |   Status  | Name | Size | Volume Type |Attached to |
+--------------------------------------+-----------+------+------+-------------+------------+
| 14b76133-e076-4bd3-b335-fa67e09e51f6 | error     | vol1 |  1   |      -      |            |
+--------------------------------------+-----------+------+------+-------------+------------+

You can then delete the volume:

$ openstack volume delete 14b76133-e076-4bd3-b335-fa67e09e51f6
Request to delete volume 14b76133-e076-4bd3-b335-fa67e09e51f6 has been accepted.

Volumes stuck in Deleting

If a volume is stuck in the deleting state then the request to delete the volume may or may not have been sent to and actioned by the back-end. If you can identify volumes on the back-end then you can examine the back-end to determine whether the volume is still there or not. Then you can decide which of the following paths you can take. It may also be useful to determine whether the back-end is responding, either by checking for recent volume create attempts, or creating and deleting a test volume.

The first option is to reset the state of the volume to available and then attempt to delete the volume again.

The second option is to reset the state of the volume to error and then delete the volume.

If you have reset the volume state to error then the volume may still be consuming storage on the back-end. If that is the case then you will need to delete it from the back-end using your back-end's specific tool.

Volumes stuck in Attaching

The most complicated situation to deal with is where a volume is stuck either in attaching or detaching, because as well as dealing with the state of the volume in cinder and the back-end, you have to deal with exports from the back-end, imports to the compute node, and attachments to the compute instance.

The two options you have here are to make sure that all exports and imports are deleted and to reset the state of the volume to available or to make sure all of the exports and imports are correct and to reset the state of the volume to in-use.

A volume that is in attaching state should never have been made available to a compute instance and therefore should not have any data written to it, or in any buffers between the compute instance and the volume back-end. In that situation, it is often safe to manually tear down the devices exported on the back-end and imported on the compute host and then reset the volume state to available.

You can use the management features of the back-end you are using to locate the compute host to where the volume is being exported.

Volumes stuck in Detaching

The steps in dealing with a volume stuck in detaching state are very similar to those for a volume stuck in attaching. However, there is the added consideration that the volume was attached to, and probably servicing, I/O from a compute instance. So you must take care to ensure that all buffers are properly flushed before detaching the volume.

When a volume is stuck in detaching, the output from a openstack volume list command will include the UUID for the instance to which the volume was attached. From that you can identify the compute host that is running the instance using the openstack server show command.

For example, here are some snippets:

$ openstack volume list
+--------------------------------------+-----------+-----------------------+-----------------+
|                  ID                  |   Status  |       Name            |   Attached to   |
+--------------------------------------+-----------+-----------------------+-----------------+
| 85384325-5505-419a-81bb-546c69064ec2 | detaching |        vol1           | 4bedaa76-78ca-… |
+--------------------------------------+-----------+-----------------------+-----------------+
$ openstack server show 4bedaa76-78ca-4fe3-806a-3ba57a9af361|grep host
| OS-EXT-SRV-ATTR:host                 | mycloud-cp1-comp0005-mgmt
| OS-EXT-SRV-ATTR:hypervisor_hostname  | mycloud-cp1-comp0005-mgmt
| hostId                               | 61369a349bd6e17611a47adba60da317bd575be9a900ea590c1be816

The first thing to check in this case is whether the instance is still importing the volume. Use virsh list and virsh dumpxml ID to see the underlying condition of the virtual machine. If the XML for the instance has a reference to the device, then you should reset the volume state to in-use and attempt the cinder detach operation again.

$ openstack volume set --state --state in-use --attach-status attached 85384325-5505-419a-81bb-546c69064ec2

If the volume gets stuck detaching again, there may be a more fundamental problem, which is outside the scope of this document and you should contact the Support team.

If the volume is not referenced in the XML for the instance then you should remove any devices on the compute node and back-end and then reset the state of the volume to available.

$ openstack volume set --state --state available --attach-status detached 85384325-5505-419a-81bb-546c69064ec2

You can use the management features of the back-end you are using to locate the compute host to where the volume is being exported.

Volumes stuck in restoring

Restoring a cinder volume from backup will be as slow as backing it up. So you must confirm that the volume is actually stuck by examining the cinder-backup.log. For example:

# tail -f cinder-backup.log |grep 162de6d5-ba92-4e36-aba4-e37cac41081b
2016-04-27 12:39:14.612 6689 DEBUG swiftclient [req-0c65ec42-8f9d-430a-b0d5-05446bf17e34 - -
2016-04-27 12:39:15.533 6689 DEBUG cinder.backup.chunkeddriver [req-0c65ec42-8f9d-430a-b0d5-
2016-04-27 12:39:15.566 6689 DEBUG requests.packages.urllib3.connectionpool [req-0c65ec42-
2016-04-27 12:39:15.567 6689 DEBUG swiftclient [req-0c65ec42-8f9d-430a-b0d5-05446bf17e34 - - -

If you determine that the volume is genuinely stuck in the restoring state then you must follow the procedure described in the detaching section above to remove any volumes that remain exported from the back-end and imported on the controller node. Remember that in this case the volumes will be imported and mounted on the controller node running cinder-backup. So you do not have to search for the correct compute host. Also remember that no instances are involved so you do not need to confirm that the volume is not imported to any instances.

18.6.1.5 Debugging volume attachment

In an error case, it is possible for a cinder volume to fail to complete an operation and revert back to its initial state. For example, attaching a cinder volume to a nova instance, so you would follow the steps above to examine the nova compute logs for the attach request.

18.6.1.6 Errors creating volumes

If you are creating a volume and it goes into the ERROR state, a common error to see is No valid host was found. This means that the scheduler could not schedule your volume to a back-end. You should check that the volume service is up and running. You can use this command:

$ sudo cinder-manage service list
Binary           Host                                 Zone             Status     State Updated At
cinder-scheduler ha-volume-manager                    nova             enabled    :-)   2016-04-25 11:39:30
cinder-volume    ha-volume-manager@ses1               nova             enabled    XXX   2016-04-25 11:27:26
cinder-backup    ha-volume-manager                    nova             enabled    :-)   2016-04-25 11:39:28

In this example, the state of XXX indicates that the service is down.

If the service is up, next check that the back-end has sufficient space. You can use this command to show the available and total space on each back-end:

openstack volume backend pool list --detail

If your deployment is using volume types, verify that the volume_backend_name in your cinder.conf file matches the volume_backend_name for the volume type you selected.

You can verify the back-end name on your volume type by using this command:

openstack volume type list

Then list the details about your volume type. For example:

$ openstack volume type show dfa8ecbd-8b95-49eb-bde7-6520aebacde0
+---------------------------------+--------------------------------------+
| Field                           | Value                                |
+---------------------------------+--------------------------------------+
| description                     | None                                 |
| id                              | dfa8ecbd-8b95-49eb-bde7-6520aebacde0 |
| is_public                       | True                                 |
| name                            | my3par                               |
| os-volume-type-access:is_public | True                                 |
| properties                      | volume_backend_name='3par'           |
+---------------------------------+--------------------------------------+

18.6.2 swift Storage Troubleshooting

Troubleshooting scenarios with resolutions for the swift service. You can use these guides to help you identify and resolve basic problems you may experience while deploying or using the Object Storage service. It contains the following troubleshooting scenarios:

18.6.2.1 Deployment Fails With MSDOS Disks Labels Do Not Support Partition Names

Description

If a disk drive allocated to swift uses the MBR partition table type, the deploy process refuses to label and format the drive. This is to prevent potential data loss. (For more information, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 11 “Modifying Example Configurations for Object Storage using Swift”, Section 11.5 “Allocating Disk Drives for Object Storage”. If you intend to use the disk drive for swift, you must convert the MBR partition table to GPT on the drive using /sbin/sgdisk.

Note
Note

This process only applies to swift drives. It does not apply to the operating system or boot drive.

Resolution

You must install gdisk, before using sgdisk:

  1. Run the following command to install gdisk:

    sudo zypper install gdisk
  2. Convert to the GPT partition type. Following is an example for converting /dev/sdd to the GPT partition type:

    sudo sgdisk -g /dev/sdd
  3. Reboot the node to take effect. You may then resume the deployment (repeat the playbook that reported the error).

18.6.2.2 Examining Planned Ring Changes

Before making major changes to your rings, you can see the planned layout of swift rings using the following steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Run the swift-compare-model-rings.yml playbook:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-compare-model-rings.yml --extra-vars "drive_detail=yes"
  3. Validate the following in the output:

    • Drives are being added to all rings in the ring specifications.

    • Servers are being used as expected (for example, you may have a different set of servers for the account/container rings than the object rings.)

    • The drive size is the expected size.

18.6.2.3 Interpreting Swift Input Model Validation Errors

The following examples provide an error message, description, and resolution.

Note
Note

To resolve an error, you must first modify the input model and re-run the configuration processor. (For instructions, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”.) Then, continue with the deployment.

  1. Example Message - Model Mismatch: Cannot find drive /dev/sdt on padawan-ccp-c1-m2 (192.168.245.3))

    Description The disk model used for node padawan-ccp-c1-m2 has drive /dev/sdt listed in the devices list of a device-group where swift is the consumer. However, the dev/sdt device does not exist on that node.
    Resolution

    If a drive or controller is failed on a node, the operating system does not see the drive and so the corresponding block device may not exist. Sometimes this is transitory and a reboot may resolve the problem. The problem may not be with /dev/sdt, but with another drive. For example, if /dev/sds is failed, when you boot the node, the drive that you expect to be called /dev/sdt is actually called /dev/sds.

    Alternatively, there may not be enough drives installed in the server. You can add drives. Another option is to remove /dev/sdt from the appropriate disk model. However, this removes the drive for all servers using the disk model.

  2. Example Message - Model Mismatch: Cannot find drive /dev/sdd2 on padawan-ccp-c1-m2 (192.168.245.3)

    Description The disk model used for node padawan-ccp-c1-m2 has drive /dev/sdt listed in the devices list of a device-group where swift is the consumer. However, the partition number (2) has been specified in the model. This is not supported - only specify the block device name (for example /dev/sdd), not partition names in disk models.
    Resolution Remove the partition number from the disk model.
  3. Example Message - Cannot find IP address of padawan-ccp-c1-m3-swift for ring: account host: padawan-ccp-c1-m3-mgmt

    Description The service (in this example, swift-account) is running on the node padawan-ccp-c1-m3. However, this node does not have a connection to the network designated for the swift-account service (that is, the SWIFT network).
    Resolution Check the input model for which networks are configured for each node type.
  4. Example Message - Ring: object-2 has specified replication_policy and erasure_coding_policy. Only one may be specified.

    Description Only either replication-policy or erasure-coding-policy may be used in ring-specifications.
    Resolution Remove one of the policy types.
  5. Example Message - Ring: object-3 is missing a policy type (replication-policy or erasure-coding-policy)

    Description There is no replication-policy or erasure-coding-policy section in ring-specifications for the object-0 ring.
    Resolution Add a policy type to the input model file.

18.6.2.4 Identifying the Swift Ring Building Server

18.6.2.4.1 Identify the swift Ring Building server

Perform the following steps to identify the swift ring building server:

  1. Log in to the Cloud Lifecycle Manager.

  2. Run the following command:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-status.yml --limit SWF-ACC[0]
  3. Examine the output of this playbook. The last line underneath the play recap will give you the server name which is your swift ring building server.

    PLAY RECAP ********************************************************************
    _SWF_CMN | status | Check systemd service running ----------------------- 1.61s
    _SWF_CMN | status | Check systemd service running ----------------------- 1.16s
    _SWF_CMN | status | Check systemd service running ----------------------- 1.09s
    _SWF_CMN | status | Check systemd service running ----------------------- 0.32s
    _SWF_CMN | status | Check systemd service running ----------------------- 0.31s
    _SWF_CMN | status | Check systemd service running ----------------------- 0.26s
    -------------------------------------------------------------------------------
    Total: ------------------------------------------------------------------ 7.88s
    ardana-cp1-c1-m1-mgmt      : ok=7    changed=0    unreachable=0    failed=0

    In the above example, the first swift proxy server is ardana-cp1-c1-m1-mgmt.

Important
Important

For the purposes of this document, any errors you see in the output of this playbook can be ignored if all you are looking for is the server name for your swift ring builder server.

18.6.2.5 Verifying a Swift Partition Label

Warning
Warning

For a system upgrade do NOT clear the label before starting the upgrade.

This topic describes how to check whether a device has a label on a partition.

18.6.2.5.1 Check Partition Label

To check whether a device has label on a partition, perform the following step:

  • Log on to the node and use the parted command:

    sudo parted -l

    The output lists all of the block devices. Following is an example output for /dev/sdc with a single partition and a label of c0a8f502h000. Because the partition has a label, if you are about to install and deploy the system, you must clear this label before starting the deployment. As part of the deployment process, the system will label the partition.

    .
    .
    .
    Model: QEMU QEMU HARDDISK (scsi)
    Disk /dev/sdc: 20.0GB
    Sector size (logical/physical): 512B/512B
    Partition Table: gpt
    Disk Flags:
    
    Number  Start   End     Size    File system  Name           Flags
    1       1049kB  20.0GB  20.0GB  xfs          c0a8f502h000
    
    .
    .
    .

18.6.2.6 Verifying a Swift File System Label

Warning
Warning

For a system upgrade do NOT clear the label before starting the upgrade.

This topic describes how to check whether a file system in a partition has a label.

To check whether a file system in a partition has a label, perform the following step:

  • Log on to the server and execute the xfs_admin command (where /dev/sdc1 is the partition where the file system is located):

    sudo xfs_admin -l /dev/sdc1

    The output shows if a file system has a label. For example, this shows a label of c0a8f502h000:

    $ sudo xfs_admin -l /dev/sdc1
    label = "c0a8f502h000"

    If no file system exists, the result is as follows:

    $ sudo xfs_admin -l /dev/sde1
    xfs_admin: /dev/sde is not a valid XFS file system (unexpected SB magic number 0x00000000)

    If you are about to install and deploy the system, you must delete the label before starting the deployment. As part of the deployment process, the system will label the partition.

18.6.2.7 Recovering swift Builder Files

When you execute the deploy process for a system, a copy of the builder files is stored on the following nodes and directories:

  • On the swift ring building node, the primary reference copy is stored in the /etc/swiftlm/<cloud-name>/<control-plane-name>/builder_dir/ directory.

  • On the next node after the swift ring building node, a backup copy is stored in the /etc/swiftlm/<cloud-name>/<control-plane-name>/builder_dir/ directory.

  • In addition, in the deploy process, the builder files are also copied to the /etc/swiftlm/deploy_dir/<cloud-name> directory on every swift node.

If these builder files are found on the primary swift ring building node (to identify which node is the primary ring building node, see Section 18.6.2.4, “Identifying the Swift Ring Building Server”) in the directory /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir, then no further recover action is needed. If not, then you need to copy the files from an intact swift node onto the primary swift ring building node.

If you have no intact /etc/swiftlm directory on any swift node, you may be able to restore from a backup. See Section 15.2.3.2, “Recovering the Control Plane”.

To restore builder files on the primary ring builder node from a backup stored on another member of the ring, use the following process:

  1. Log in to the swift ring building server (To identify the swift ring building server, see Section 18.6.2.4, “Identifying the Swift Ring Building Server”).

  2. Create the /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir directory structure with these commands:

    Replace CLOUD_NAME with the name of your cloud and CONTROL_PLANE_NAME with the name of your control plane.

    tux > sudo mkdir -p /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/
    tux > sudo chown -R ardana.ardana /etc/swiftlm/
  3. Log in to a swift node where an intact /etc/swiftlm/deploy_dir directory exists.

  4. Copy the builder files to the swift ring building node. In the example below we use scp to transfer the files, where swpac-c1-m2-mgmt is the node where the files can be found, cloud1 is the cloud, and cp1 is the control plane name:

    tux > sudo mkdir -p /etc/swiftlm/cloud1/cp1/builder_dir
    tux > sudo cd /etc/swiftlm/cloud1/cp1/builder_dir
    tux > sudo scp -r ardana@swpac-ccp-c1-m1-mgmt:/etc/swiftlm/cloud1/cp1/builder_dir/* ./
    tux > sudo chown -R swift:swift /etc/swiftlm

    (Any permissions errors related to files in the backups directory can be ignored.)

  5. Skip this step if you are rebuilding the entire node. It should only be used if swift components are already present and functioning on the server, and you are recovering or updating the ring builder files.

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml

18.6.2.8 Restarting the Object Storage Deployment

This page describes the various operational procedures performed by swift.

18.6.2.8.1 Restart the Swift Object Storage Deployment

The structure of ring is built in an incremental stages. When you modify a ring, the new ring uses the state of the old ring as a basis for the new ring. Rings are stored in the builder file. The swiftlm-ring-supervisor stores builder files in the /etc/swiftlm/cloud1/cp1/builder_dir/ directory on the Ring-Builder node. The builder files are named <ring-name> builder. Prior versions of the builder files are stored in the /etc/swiftlm/cloud1/cp1/builder_dir/backups directory.

Generally, you use an existing builder file as the basis for changes to a ring. However, at initial deployment, when you create a ring there will be no builder file. Instead, the first step in the process is to build a builder file. The deploy playbook does this as a part of the deployment process. If you have successfully deployed some of the system, the ring builder files will exist.

If you change your input model (for example, by adding servers) now, the process assumes you are modifying a ring and behaves differently than while creating a ring from scratch. In this case, the ring is not balanced. So, if the cloud model contains an error or you decide to make substantive changes, it is a best practice to start from scratch and build rings using the steps below.

18.6.2.8.2 Reset Builder Files

You must reset the builder files during the initial deployment process (only). This process should be used only when you want to restart a deployment from scratch. If you reset the builder files after completing your initial deployment, then you are at a risk of losing critical system data.

Delete the builder files in the /etc/swiftlm/cloud1/cp1/builder-dir/ directory. For example, for the region0 keystone region (the default single region designation), do the following:

sudo rm /etc/swiftlm/cloud1/cp1/builder_dir/*.builder
Note
Note

If you have successfully deployed a system and accidentally delete the builder files, you can recover to the correct state. For instructions, see Section 18.6.2.7, “Recovering swift Builder Files”.

18.6.2.9 Increasing the Swift Node Timeout Value

On a heavily loaded Object Storage system timeouts may occur when transferring data to or from swift, particularly large objects.

The following is an example of a timeout message in the log (/var/log/swift/swift.log) on a swift proxy server:

Jan 21 16:55:08 ardana-cp1-swpaco-m1-mgmt proxy-server: ERROR with Object server 10.243.66.202:6000/disk1 re: Trying to write to
/v1/AUTH_1234/testcontainer/largeobject: ChunkWriteTimeout (10s)

If this occurs, it may be necessary to increase the node_timeout parameter in the proxy-server.conf configuration file.

The node_timeout parameter in the swift proxy-server.conf file is the maximum amount of time the proxy server will wait for a response from the account, container, or object server. The default value is 10 seconds.

In order to modify the timeout you can use these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Edit the ~/openstack/my_cloud/config/swift/proxy-server.conf.j2 file and add a line specifying the node_timeout into the [app:proxy-server] section of the file.

    Example, in bold, increasing the timeout to 30 seconds:

    [app:proxy-server]
    use = egg:swift#proxy
    .
    .
    node_timeout = 30
  3. Commit your configuration to the Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”, as follows:

    cd ~/openstack/ardana/ansible
    git add -A
    git commit -m "My config or other commit message"
  4. Run the configuration processor:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost config-processor-run.yml
  5. Use the playbook below to create a deployment directory:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost ready-deployment.yml
  6. Change to the deployment directory and run the swift reconfigure playbook:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml

18.6.2.10 Troubleshooting Swift File System Usage Issues

If you have recycled your environment to do a re-installation and you haven't run the wipe_disks.yml playbook in the process, you may experience an issue where your file system usage continues to grow exponentially even though you are not adding any files to your swift system. This is likely occurring because the quarantined directory is getting filled up. You can find this directory at /srv/node/disk0/quarantined.

You can resolve this issue by following these steps:

  1. SSH to each of your swift nodes and stop the replication processes on each of them. The following commands must be executed on each of your swift nodes. Make note of the time that you performed this action as you will reference it in step three.

    sudo systemctl stop swift-account-replicator
    sudo systemctl stop swift-container-replicator
    sudo systemctl stop swift-object-replicator
  2. Examine the /var/log/swift/swift.log file for events that indicate when the auditor processes have started and completed audit cycles. For more details, see Section 18.6.2.10, “Troubleshooting Swift File System Usage Issues”.

  3. Wait until you see that the auditor processes have finished two complete cycles since the time you stopped the replication processes (from step one). You must check every swift node, which on a lightly loaded system that was recently installed this should take less than two hours.

  4. At this point you should notice that your quarantined directory has stopped growing. You may now delete the files in that directory on each of your nodes.

  5. Restart the replication processes using the swift start playbook:

    1. Log in to the Cloud Lifecycle Manager.

    2. Run the swift start playbook:

      cd ~/scratch/ansible/next/ardana/ansible
      ansible-playbook -i hosts/verb_hosts swift-start.yml
18.6.2.10.1 Examining the swift Log for Audit Event Cycles

Below is an example of the object-server start and end cycle details. They were taken by using the following command on a swift node:

sudo grep object-auditor /var/log/swift/swift.log|grep ALL

Example output:

$ sudo grep object-auditor /var/log/swift/swift.log|grep ALL
...
Apr  1 13:31:18 padawan-ccp-c1-m1-mgmt object-auditor: Begin object audit "forever" mode (ALL)
Apr  1 13:31:18 padawan-ccp-c1-m1-mgmt object-auditor: Object audit (ALL). Since Fri Apr  1 13:31:18 2016: Locally: 0 passed, 0 quarantined, 0 errors files/sec: 0.00 , bytes/sec: 0.00, Total time: 0.00, Auditing time: 0.00, Rate: 0.00
Apr  1 13:51:32 padawan-ccp-c1-m1-mgmt object-auditor: Object audit (ALL) "forever" mode completed: 1213.78s. Total quarantined: 0, Total errors: 0, Total files/sec: 7.02, Total bytes/sec: 9999722.38, Auditing time: 1213.07, Rate: 1.00

In this example, the auditor started at 13:31 and ended at 13:51.

In this next example, the account-auditor and container-auditor use similar message structure, so we only show the container auditor. You can substitute account for container as well:

$ sudo grep container-auditor /var/log/swift/swift.log
...
Apr  1 14:07:00 padawan-ccp-c1-m1-mgmt container-auditor: Begin container audit pass.
Apr  1 14:07:00 padawan-ccp-c1-m1-mgmt container-auditor: Since Fri Apr  1 13:07:00 2016: Container audits: 42 passed audit, 0 failed audit
Apr  1 14:37:00 padawan-ccp-c1-m1-mgmt container-auditor: Container audit pass completed: 0.10s

In the example, the container auditor started a cycle at 14:07 and the cycle finished at 14:37.

18.7 Monitoring, Logging, and Usage Reporting Troubleshooting

Troubleshooting scenarios with resolutions for the Monitoring, Logging, and Usage Reporting services.

18.7.1 Troubleshooting Centralized Logging

This section contains the following scenarios:

18.7.1.1 Reviewing Log Files

You can troubleshoot service-specific issues by reviewing the logs. After logging into Kibana, follow these steps to load the logs for viewing:

  1. Navigate to the Settings menu to configure an index pattern to search for.

  2. In the Index name or pattern field, you can enter logstash-* to query all Elasticsearch indices.

  3. Click the green Create button to create and load the index.

  4. Navigate to the Discover menu to load the index and make it available to search.

Note
Note

If you want to search specific Elasticsearch indices, you can run the following command from the control plane to get a full list of available indices:

curl localhost:9200/_cat/indices?v

Once the logs load you can change the timeframe from the dropdown in the upper-righthand corner of the Kibana window. You have the following options to choose from:

  • Quick - a variety of time frame choices will be available here

  • Relative - allows you to select a start time relative to the current time to show this range

  • Absolute - allows you to select a date range to query

When searching there are common fields you will want to use, such as:

  • type - this will include the service name, such as keystone or ceilometer

  • host - you can specify a specific host to search for in the logs

  • file - you can specify a specific log file to search

For more details on using Kibana and Elasticsearch to query logs, see https://www.elastic.co/guide/en/kibana/3.0/working-with-queries-and-filters.html

18.7.1.2 Monitoring Centralized Logging

To help keep ahead of potential logging issues and resolve issues before they affect logging, you may want to monitor the Centralized Logging Alarms.

To monitor logging alarms:

  1. Log in to Operations Console.

  2. From the menu button in the upper left corner, navigate to the Alarm Definitions page.

  3. Find the alarm definitions that are applied to the various hosts. See the Section 18.1.1, “Alarm Resolution Procedures” for the Centralized Logging Alarm Definitions.

  4. Navigate to the Alarms page

  5. Find the alarm definitions applied to the various hosts. These should match the alarm definitions in the Section 18.1.1, “Alarm Resolution Procedures”.

  6. See if the alarm is green (good) or is in a bad state. If any are in a bad state, see the possible actions to perform in the Section 18.1.1, “Alarm Resolution Procedures”.

You can use this filtering technique in the "Alarms" page to look for the following:

  1. To look for processes that may be down, filter for "Process" then make sure the process are up:

    • Elasticsearch

    • Logstash

    • Beaver

    • Apache (Kafka)

    • Kibana

    • monasca

  2. To look for sufficient disk space, filter for "Disk"

  3. To look for sufficient RAM memory, filter for "Memory"

18.7.1.3 Situations In Which Logs Might Not Be Collected

Centralized logging might not collect log data under the following circumstances:

  • If the Beaver service is not running on one or more of the nodes (controller or compute), logs from these nodes will not be collected.

18.7.1.4 Error When Creating a Kibana Visualization

When creating a visualization in Kibana you may get an error similiar to this:

"logstash-*" index pattern does not contain any of the following field types: number

To resolve this issue:

  1. Log in to Kibana.

  2. Navigate to the Settings page.

  3. In the left panel, select the logstash-* index.

  4. Click the Refresh button. You may see a mapping conflict warning after refreshing the index.

  5. Re-create the visualization.

18.7.1.5 After Deploying Logging-API, Logs Are Not Centrally Stored

If you are using the Logging-API and logs are not being centrally stored, use the following checklist to troubleshoot Logging-API.

Item
 

Ensure monasca is running.

 

Check any alarms monasca has triggered.

 

Check to see if the Logging-API (monasca-log-api) process alarm has triggered.

 

Run an Ansible playbook to get status of the Cloud Lifecycle Manager:

ansible-playbook -i hosts/verb_hosts ardana-status.yml
 

Troubleshoot all specific tasks that have failed on the Cloud Lifecycle Manager.

 Ensure that the Logging-API daemon is up.
 

Run an Ansible playbook to try and bring the Logging-API daemon up:

ansible-playbook –I hosts/verb_hosts logging-start.yml
 

If you get errors trying to bring up the daemon, resolve them.

 

Verify the Logging-API configuration settings are correct in the configuration file:

roles/kronos-api/templates/kronos-apache2.conf.j2

The following is a sample Logging-API configuration file:

{#
# (c) Copyright 2015-2016 Hewlett Packard Enterprise Development LP
# Licensed under the Apache License, Version 2.0 (the "License"); you may
# not use this file except in compliance with the License. You may obtain
# a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
# License for the specific language governing permissions and limitations
# under the License.
#
#}
Listen {{ kronos_api_host }}:{{ kronos_api_port }}
<VirtualHost *:{{ kronos_api_port }}>
    WSGIDaemonProcess log-api processes=4 threads=4 socket-timeout=300  user={{ kronos_user }} group={{ kronos_group }} python-path=/opt/stack/service/kronos/venv:/opt/stack/service/kronos/venv/bin/../lib/python2.7/site-packages/ display-name=monasca-log-api
    WSGIProcessGroup log-api
    WSGIApplicationGroup log-api
    WSGIScriptAlias / {{ kronos_wsgi_dir }}/app.wsgi
    ErrorLog /var/log/kronos/wsgi.log
    LogLevel info
    CustomLog /var/log/kronos/wsgi-access.log combined

    <Directory /opt/stack/service/kronos/venv/bin/../lib/python2.7/site-packages/monasca_log_api>
      Options Indexes FollowSymLinks MultiViews
      Require all granted
      AllowOverride None
      Order allow,deny
      allow from all
      LimitRequestBody 102400
    </Directory>

    SetEnv no-gzip 1
</VirtualHost>

18.7.1.6 Re-enabling Slow Logging

MariaDB slow logging was enabled by default in earlier versions. Slow logging logs slow MariaDB queries to /var/log/mysql/mysql-slow.log on FND-MDB hosts.

As it is possible for temporary tokens to be logged to the slow log, we have disabled slow log in this version for security reasons.

To re-enable slow logging follow the following procedure:

  1. Login to the Cloud Lifecycle Manager and set a mariadb service configurable to enable slow logging.

    cd ~/openstack/my_cloud
    1. Check slow_query_log is currently disabled with a value of 0:

      grep slow ./config/percona/my.cfg.j2
      slow_query_log          = 0
      slow_query_log_file     = /var/log/mysql/mysql-slow.log
    2. Enable slow logging in the server configurable template file and confirm the new value:

      sed -e 's/slow_query_log = 0/slow_query_log = 1/' -i ./config/percona/my.cfg.j2
      grep slow ./config/percona/my.cfg.j2
      slow_query_log          = 1
      slow_query_log_file     = /var/log/mysql/mysql-slow.log
    3. Commit the changes:

      git add -A
      git commit -m "Enable Slow Logging"
  2. Run the configuration procesor.

    cd ~/openstack/ardana/ansible/
    ansible-playbook -i hosts/localhost config-processor-run.yml
  3. You will be prompted for an encryption key, and also asked if you want to change the encryption key to a new value, and it must be a different key. You can turn off encryption by typing the following:

    ansible-playbook -i hosts/localhost config-processor-run.yml -e encrypt="" -e rekey=""
  4. Create a deployment directory.

    ansible-playbook -i hosts/localhost ready-deployment.yml
  5. Reconfigure Percona (note this will restart your mysqld server on your cluster hosts).

    ansible-playbook -i hosts/verb_hosts percona-reconfigure.yml

18.7.2 Usage Reporting Troubleshooting

Troubleshooting scenarios with resolutions for the ceilometer service.

This page describes troubleshooting scenarios for ceilometer.

18.7.2.1 Logging

Logs for the various running components in the Overcloud Controllers can be found at /var/log/ceilometer.log

The Upstart for the services also logs data at /var/log/upstart

18.7.2.2 Modifying

Change the level of debugging in ceilometer by editing the ceilometer.conf file located at /etc/ceilometer/ceilometer.conf. To log the maximum amount of information, change the level entry to DEBUG.

Note: When the logging level for a service is changed, that service must be re-started before the change will take effect.

This is an excerpt of the ceilometer.conf configuration file showing where to make changes:

[loggers]
 keys: root

[handlers]
 keys: watchedfile, logstash

[formatters]
 keys: context, logstash

[logger_root]
 qualname: root
 handlers: watchedfile, logstash
 level: NOTSET

18.7.2.3 Messaging/Queuing Errors

ceilometer relies on a message bus for passing data between the various components. In high-availability scenarios, RabbitMQ servers are used for this purpose. If these servers are not available, the ceilometer log will record errors during "Connecting to AMQP" attempts.

These errors may indicate that the RabbitMQ messaging nodes are not running as expected and/or the RPC publishing pipeline is stale. When these errors occur, re-start the instances.

Example error:

Error: unable to connect to node 'rabbit@xxxx-rabbitmq0000': nodedown

Use the RabbitMQ CLI to re-start the instances and then the host.

  1. Restart the downed cluster node.

    sudo invoke-rc.d rabbitmq-server start
  2. Restart the RabbitMQ host

    sudo rabbitmqctl start_app

18.8 Orchestration Troubleshooting

Troubleshooting scenarios with resolutions for the Orchestration services. Troubleshooting scenarios with resolutions for the Orchestration services.

18.8.1 Heat Troubleshooting

Troubleshooting scenarios with resolutions for the heat service.

18.8.1.1 RPC timeout on Heat Stack Creation

If you exerience a remote procedure call (RPC) timeout failure when attempting heat stack-create, you can work around the issue by increasing the timeout value and purging records of deleted stacks from the database. To do so, follow the steps below. An example of the error is:

MessagingTimeout: resources.XXX-LCP-Pair01.resources[0]: Timed out waiting for a reply to message ID e861c4e0d9d74f2ea77d3ec1984c5cb6
  1. Increase the timeout value.

    ardana > cd ~/openstack/my_cloud/config/heat
  2. Make changes to heat config files. In heat.conf.j2, add this timeout value:

    rpc_response_timeout=300

    Commit your changes:

    git commit -a -m "some message"
  3. Move to the ansible directory and run the following playbooks:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  4. Change to the scratch directory and run heat-reconfigure:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts heat-reconfigure.yml
  5. Purge records of deleted stacks from the database. First delete all stacks that are in failed state. Then execute the following

    sudo /opt/stack/venv/heat-20151116T000451Z/bin/python2
    /opt/stack/service/heat-engine/venv/bin/heat-manage
    --config-file /opt/stack/service/heat-engine-20151116T000451Z/etc/heat/heat.conf
    --config-file /opt/stack/service/heat-engine-20151116T000451Z/etc/heat/engine.conf purge_deleted 0

18.8.1.2 General Heat Stack Creation Errors

Generally in heat, when a timeout occurs it means that the underlying resource service such as nova, neutron, or cinder fails to complete the required action. No matter what error this underlying service reports, heat simply reports it back. So in the case of time-out in heat stack create, look at the logs of the underlying services, most importantly the nova service, to understand the reason for the timeout.

18.8.1.3 Multiple Heat Stack Create Failure

The monasca AlarmDefinition resource, OS::monasca::AlarmDefinition used for heat autoscaling, consists of an optional property name for defining the alarm name. In case this optional property being specified in the heat template, this name must be unique in the same project of the system. Otherwise, multiple heat stack create using this heat template will fail with the following conflict:

| cpu_alarm_low  | 5fe0151b-5c6a-4a54-bd64-67405336a740 | HTTPConflict: resources.cpu_alarm_low: An alarm definition already exists for project / tenant: 835d6aeeb36249b88903b25ed3d2e55a named: CPU utilization less than 15 percent  | CREATE_FAILED  | 2016-07-29T10:28:47 |

This is due to the fact that the monasca registers the alarm definition name using this name property when it is defined in the heat template. This name must be unique.

To avoid this problem, if you want to define an alarm name using this property in the template, you must be sure this name is unique within a project in the system. Otherwise, you can leave this optional property undefined in your template. In this case, the system will create an unique alarm name automatically during heat stack create.

18.8.1.4 Unable to Retrieve QOS Policies

Launching the Orchestration Template Generator may trigger the message: Unable to retrieve resources Qos Policies. This is a known upstream bug. This information message can be ignored.

18.8.2 Troubleshooting Magnum Service

Troubleshooting scenarios with resolutions for the Magnum service. Magnum Service provides container orchestration engines such as Docker Swarm, Kubernetes, and Apache Mesos available as first class resources. You can use this guide to help with known issues and troubleshooting of Magnum services.

18.8.2.1 Magnum cluster fails to create

Typically, small size clusters need about 3-5 minutes to stand up. If cluster stand up takes longer, you may proceed with troubleshooting, not waiting for status to turn to CREATE_FAILED after timing out.

  1. Use heat resource-list STACK-ID to identify which heat stack resource is stuck in CREATE_IN_PROGRESS.

    Note
    Note

    The main heat stack has nested stacks, one for kubemaster(s) and one for kubeminion(s). These stacks are visible as resources of type OS::heat::ResourceGroup (in parent stack) and file:///... in nested stack. If any resource remains in CREATE_IN_PROGRESS state within the nested stack, the overall state of the resource will be CREATE_IN_PROGRESS.

    $ heat resource-list -n2 22385a42-9e15-49d9-a382-f28acef36810
    +-------------------------------+--------------------------------------+--------------------------------------+--------------------+----------------------+------------------------------------------------------------------+
    | resource_name                 | physical_resource_id                 | resource_type                        | resource_status    | updated_time         | stack_name                                                       |
    +-------------------------------+--------------------------------------+--------------------------------------+--------------------+----------------------+------------------------------------------------------------------+
    | api_address_floating_switch   | 06b2cc0d-77f9-4633-8d96-f51e2db1faf3 | Magnum::FloatingIPAddressSwitcher    | CREATE_COMPLETE    | 2017-04-10T21:25:10Z | my-cluster-z4aquda2mgpv                                          |
    . . .
    
    | fixed_subnet                  | d782bdf2-1324-49db-83a8-6a3e04f48bb9 | OS::neutron::Subnet                  | CREATE_COMPLETE    | 2017-04-10T21:25:11Z | my-cluster-z4aquda2mgpv                                          |
    | kube_masters                  | f0d000aa-d7b1-441a-a32b-17125552d3e0 | OS::heat::ResourceGroup              | CREATE_IN_PROGRESS | 2017-04-10T21:25:10Z | my-cluster-z4aquda2mgpv                                          |
    | 0                             | b1ff8e2c-23dc-490e-ac7e-14e9f419cfb6 | file:///opt/s...ates/kubemaster.yaml | CREATE_IN_PROGRESS | 2017-04-10T21:25:41Z | my-cluster-z4aquda2mgpv-kube_masters-utyggcbucbhb                |
    | kube_master                   | 4d96510e-c202-4c62-8157-c0e3dddff6d5 | OS::nova::Server                     | CREATE_IN_PROGRESS | 2017-04-10T21:25:48Z | my-cluster-z4aquda2mgpv-kube_masters-utyggcbucbhb-0-saafd5k7l7im |
    . . .
  2. If stack creation failed on some native OpenStack resource, like OS::nova::Server or OS::neutron::Router, proceed with respective service troubleshooting. This type of error usually does not cause time out, and cluster turns into status CREATE_FAILED quickly. The underlying reason of the failure, reported by heat, can be checked via the magnum cluster-show command.

  3. If stack creation stopped on resource of type OS::heat::WaitCondition, heat is not receiving notification from cluster VM about bootstrap sequence completion. Locate corresponding resource of type OS::nova::Server and use its physical_resource_id to get information about the VM (which should be in status CREATE_COMPLETE)

    $ openstack server show 4d96510e-c202-4c62-8157-c0e3dddff6d5
    +--------------------------------------+---------------------------------------------------------------------------------------------------------------+
    | Property                             | Value                                                                                                         |
    +--------------------------------------+---------------------------------------------------------------------------------------------------------------+
    | OS-DCF:diskConfig                    | MANUAL                                                                                                        |
    | OS-EXT-AZ:availability_zone          | nova                                                                                                          |
    | OS-EXT-SRV-ATTR:host                 | comp1                                                                                                         |
    | OS-EXT-SRV-ATTR:hypervisor_hostname  | comp1                                                                                                         |
    | OS-EXT-SRV-ATTR:instance_name        | instance-00000025                                                                                             |
    | OS-EXT-STS:power_state               | 1                                                                                                             |
    | OS-EXT-STS:task_state                | -                                                                                                             |
    | OS-EXT-STS:vm_state                  | active                                                                                                        |
    | OS-SRV-USG:launched_at               | 2017-04-10T22:10:40.000000                                                                                    |
    | OS-SRV-USG:terminated_at             | -                                                                                                             |
    | accessIPv4                           |                                                                                                               |
    | accessIPv6                           |                                                                                                               |
    | config_drive                         |                                                                                                               |
    | created                              | 2017-04-10T22:09:53Z                                                                                          |
    | flavor                               | m1.small (2)                                                                                                  |
    | hostId                               | eb101a0293a9c4c3a2d79cee4297ab6969e0f4ddd105f4d207df67d2                                                      |
    | id                                   | 4d96510e-c202-4c62-8157-c0e3dddff6d5                                                                          |
    | image                                | fedora-atomic-26-20170723.0.x86_64 (4277115a-f254-46c0-9fb0-fffc45d2fd38)                                     |
    | key_name                             | testkey                                                                                                       |
    | metadata                             | {}                                                                                                            |
    | name                                 | my-zaqshggwge-0-sqhpyez4dig7-kube_master-wc4vv7ta42r6                                                         |
    | os-extended-volumes:volumes_attached | [{"id": "24012ce2-43dd-42b7-818f-12967cb4eb81"}]                                                              |
    | private network                      | 10.0.0.14, 172.31.0.6                                                                                         |
    | progress                             | 0                                                                                                             |
    | security_groups                      | my-cluster-z7ttt2jvmyqf-secgroup_base-gzcpzsiqkhxx, my-cluster-z7ttt2jvmyqf-secgroup_kube_master-27mzhmkjiv5v |
    | status                               | ACTIVE                                                                                                        |
    | tenant_id                            | 2f5b83ab49d54aaea4b39f5082301d09                                                                              |
    | updated                              | 2017-04-10T22:10:40Z                                                                                          |
    | user_id                              | 7eba6d32db154d4790e1d3877f6056fb                                                                              |
    +--------------------------------------+---------------------------------------------------------------------------------------------------------------+
  4. Use the floating IP of the master VM to log into first master node. Use the appropriate username below for your VM type. Passwords should not be required as the VMs should have public ssh key installed.

    VM TypeUsername
    Kubernetes or Swarm on Fedora Atomicfedora
    Kubernetes on CoreOScore
    Mesos on Ubuntuubuntu
  5. Useful dianostic commands

    • Kubernetes cluster on Fedora Atomic

      sudo journalctl --system
      sudo journalctl -u cloud-init.service
      sudo journalctl -u etcd.service
      sudo journalctl -u docker.service
      sudo journalctl -u kube-apiserver.service
      sudo journalctl -u kubelet.service
      sudo journalctl -u wc-notify.service
    • Kubernetes cluster on CoreOS

      sudo journalctl --system
      sudo journalctl -u oem-cloudinit.service
      sudo journalctl -u etcd2.service
      sudo journalctl -u containerd.service
      sudo journalctl -u flanneld.service
      sudo journalctl -u docker.service
      sudo journalctl -u kubelet.service
      sudo journalctl -u wc-notify.service
    • Swarm cluster on Fedora Atomic

      sudo journalctl --system
      sudo journalctl -u cloud-init.service
      sudo journalctl -u docker.service
      sudo journalctl -u swarm-manager.service
      sudo journalctl -u wc-notify.service
    • Mesos cluster on Ubuntu

      sudo less /var/log/syslog
      sudo less /var/log/cloud-init.log
      sudo less /var/log/cloud-init-output.log
      sudo less /var/log/os-collect-config.log
      sudo less /var/log/marathon.log
      sudo less /var/log/mesos-master.log

18.9 Troubleshooting Tools

Tools to assist with troubleshooting issues in your cloud. Additional troubleshooting information is available at Section 18.1, “General Troubleshooting”.

18.9.1 Retrieving the SOS Report

The SOS report provides debug level information about your environment to assist in troubleshooting issues. When troubleshooting and debugging issues in your SUSE OpenStack Cloud environment you can run an ansible playbook that will provide you with a full debug report, referred to as a SOS report. These reports can be sent to the support team when seeking assistance.

18.9.1.1 Retrieving the SOS Report

  1. Log in to the Cloud Lifecycle Manager.

  2. Run the SOS report ansible playbook:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts sosreport-run.yml
  3. Retrieve the SOS report tarballs, which will be in the following directories on your Cloud Lifecycle Manager:

    /tmp
    /tmp/sosreport-report-archives/
  4. You can then use these reports to troubleshoot issues further or provide to the support team when you reach out to them.

Warning
Warning

The SOS Report may contain sensitive information because service configuration file data is included in the report. Please remove any sensitive information before sending the SOSReport tarball externally.