AWS Cloud Operations & Migrations Blog

Achieving operational excellence by integrating AWS Health into change process

Operations teams create and use procedures to respond to operational events and need to ensure their effectiveness to support business needs. Everything continues to change—your business context, business priorities, and customer needs. It’s important to design operations to support changes over time in response to business iteration, and to incorporate lessons learned to minimize failures and reduce business impacts.

In the context of production change, whenever a failure occurs during the change window, your operations team has to find the root cause as well as revert the environment back to the last functioning version. The root cause investigation needs to determine if the failure was caused by the application (e.g. defects of your application, change process), or by the underlying AWS service events.

To avoid the impacts from existing AWS service events, the AWS Health Integrated Operational Change Process adopts the well-architected mechanism to proactively gain the visibility for the health status of AWS Services involved before executing operational changes. In this blog, we will show you how to create this change process using AWS Health and AWS Systems Manager.

Drawbacks of deploying changes without visibility of your AWS services

There are several disadvantages when changes are deployed to the workload without the visibility into the health status of involved AWS services:

  • Lacking a proactive capability to avoid impacts from AWS service events
    • When the workload is inoperable after the deployment, the operations team usually takes a retrospective view to investigate a failed deployment in which AWS service events can be the root causes. However, we are trying to fix the issue after it has already happened but didn’t perform “pre-mortem” exercises to identify potential sources of failure so that they can be removed or mitigated even before applying the changes to production.
    • You want to have the visibility when AWS service events happen, to be able to avoid impacts to your change process. E.g. postpone non-critical production changes in that case.
  • Manual process for AWS service event awareness
    • You manually keep checking the AWS Health Dashboard from the AWS Console to monitor the most recent AWS service events status.
    • Your team can reactively open an AWS support case to understand if a change failed due to a related AWS service event, and also try to get the latest AWS service events info from AWS Support if the AWS service event is the root cause.
  • False positive suspicions on application defects
    • Normally it requires collective efforts in your team to investigate the potential application defects that can lead to the production change failures.
    • You want to avoid those false positive suspicions on application defects, so your teams can focus on strategic activities and outcomes.

As a result, having AWS Health Integrated Operational Change Process will enable you to avoid impacts from existing AWS service events, so to enhance the resiliency of your overall operational process.

Well-Architected Operational Excellence

One of the design principals in the Operational Excellence pillar of AWS Well-Architected Framework is to “Anticipate failure” which encourages you to perform “pre-mortem” exercises to identify potential sources of failure so that they can be removed or mitigated.

Figure 1: Well-Architected Operational Excellence Design Principle – Anticipate Failure

To integrate “Anticipate Failure” mindset into your operational change process, you need to take various types of failures into considerations apart from just application failures. These include any failure in the underlying cloud infrastructures (such as failures in a Region / Availability Zone), failures at the cloud service level which can impact any new service API calls during the change process, and other failures reported by the cloud service providers.

An effective approach to equip your operational change process with “Anticipate Failure” mindset is to integrate AWS Health into Operational Change Process, which involves checking for any AWS Health issues before kicking in changes in your production environment. This is very important, especially for your critical production changes, where you want to avoid any impact from existing cloud service events. By integrating AWS Health into change processes, you can streamline the AWS service status update process, and also integrate it with your production change pipeline to align with Operational Excellence Design Principles.

Given that existing AWS service events probably won’t impact specific production change, you should work out a change strategy that aligns with your business goals and criticality of your workload. More specifically:

  • For critical operational changes, you may consider to suspend the production change whenever the change pipeline detected any active AWS service events, which is a recommended plan to control and remediate potential risks.
  • For other non-critical operational changes, you might have the change pipeline analyzed to determine if the existing AWS service event is related to the services used by your workload, so you can make an informed Go or No-Go decision by the automation pipeline, or an approval process by a person.

AWS service health status plays an important role in the above change process, which can be retrieved through AWS Health API. It allows you to programmatically receive information about AWS service degradations, resource maintenance events, AWS accounts and resources impacted by service events across AWS regions.

With regards to Operational Change Process, there are various tools you could use to define the automated change process. In AWS, you can leverage AWS Systems Manager Change Manager to define automated change process, which executes the Automation runbooks to walk through the predefined workflow to meet the business goal. For example, you can use an AWS-managed runbook to update hundreds or thousands of EC2 instances by a few clicks through the AWS console. You can also define the approval steps before the production process starts, so the relevant stakeholder can endorse the change request before it’s been launched.

With the AWS Systems Manager Change Manager, Automation runbooks, and AWS Health API, you are able to create an AWS Health Integrated Operational Change Process, which can address the challenges highlighted earlier by:

  • Enabling operations teams to integrate AWS Health capabilities into their change process, so to proactively respond to the active AWS service events (e.g. stop the change pipeline when there’s active service event).
  • Streamlining the AWS service status update procedure through AWS Health API in a programmatic way, where your operations team can define relevant response workflow based on the active AWS service event information (e.g. suspend the change pipeline and send notifications to relevant stakeholders if the service events impacting the services in the change process).
  • Avoiding false positive suspicions on application defects through effectively checking AWS Health API before the change process starts, which helps your teams stay focused on value-added tasks without unnecessary interruptions when there’s active AWS service event that can impact the change pipeline.

Once your team has integrated the AWS Health capabilities into the AWS Systems Manager Change Manager template, you can reuse this pattern to various change process defined by AWS Systems Manager Automation runbooks. It’s an efficient way to generate reusable change process at organizational scale.

In the following sections, I will walk you through how to embed AWS Health API insights into your operational change process to automatically suspend operational changes whenever an AWS Health event is reported in a Region that you’re operating in.

To gain hands-on experience of building an AWS Health Integrated Operational Change Process, you can refer to this AWS Well-Architected lab: Build AWS Health Aware Operation Change Process. You can also find the detailed CloudFormation stack of this lab in the AWS Lab github repository.

Prerequisites

The following prerequisites are crucial:

Architecture Overview

The solution leverages the following AWS Services and features in your AWS Account:

  1. AWS Systems Manager Change Manager which is an enterprise change management framework for requesting, approving, implementing, and reporting on operational changes to your application configuration and infrastructure.
  2. AWS Systems Manager Run Book helps you to build automated solutions to deploy, configure, and manage AWS resources at scale.
  3. AWS Health provides the ongoing visibility into the availability of your AWS services, where you can programmatically retrieve those availability data via AWS Health API.

Figure 2: Solution architecture

In this example, the solution works as follows:

  1. Operations Administrators define a change template in AWS Systems Manager Change Manager. In a nutshell, a change template is a collection of configuration settings in Change Manager that define approval process, available runbooks, and notification options for change requests.
  2. A custom runbook was defined to check the AWS Health status, and proceed the Go or No-Go operational path based on the AWS Health check result. The runbook logic has several steps as follows:

Figure 3: Change Manager Detailed Automation Process

2.1. Poll AWS Health API – Change Manager will execute the Automation Runbook with ‘aws:executeScript’ step to run a script, which will call the AWS Health DescribeEvents API to retrieve the list of active health incidents. Then, the script will complete the event analysis and decide whether or not it may impact the running deployment.

Note: If you are using AWS Systems Manager Change Manager in AWS Organizations, you may consider to call AWS Health DescribeEventsForOrganization API to have the visibility on whether there’s active AWS service event that can impact any account within your AWS Organizations.

2.2. The Automation Runbook will execute “aws:branch” task to:

2.2a. If there’s active AWS Service events, then cancel the workflow, and send SNS notification to admins.

2.2b. If there’s no active service events, proceed the operation, e.g. Use AWS Public Automation Document “AWS-InstallApplication” to automatically install applications on EC2 instances.

3. After the change template and authorized Runbook with the above logic are approved, the operations team can rely on the following Change Process to apply the change into the Production environment:

Figure 4: Solution workflow

3.1 Operations Engineer creates a change request based on approved template.

3.2 The Approver receives a notification, based on the SNS topic configured in the template. The approver can then review and approve the change request.

3.3 The change request will be scheduled and the Runbook will be executed within the change window.

3.4 Both Operations Engineer and the Approver will be notified after the Change Request was completed (3.4a) or suspended (3.4b) if there are on-going AWS Health events.

Conclusion

We’ve learned how to embed AWS Health into the operational change process managed by AWS Systems Manager Change Manager. This approach will have significant impacts in achieving operational excellence and minimize business disruptions during operational changes.

“Anticipate Failure” is part of the Operational Excellence Pillar of the AWS Well-Architected Framework. AWS Well-Architected is a set of guiding design principles developed by AWS to help organizations build secure, high-performing, resilient, and efficient infrastructures for a variety of applications and workloads. Use the AWS Well-Architected Tool to review your workloads periodically to address important design considerations and ensure that they follow the best practices and guidance of the AWS Well-Architected Framework. For follow up questions or comments, join our growing community on AWS re:Post.

About the authors

Jerry Chen

Jerry Chen is currently a Senior AWS Well-Architected GEO Solutions Architect at Amazon Web Services (AWS). He’s been focusing on cloud security and operational architecture design for AWS customers and partners.

You can follow Jerry on LinkedIn.

Phong Le

Phong Le is a Solutions Architect on the AWS Well-Architected team. As part of the AWS Well-Architected team, Phong’s focus is to work with AWS customers and AWS Partner Network (APN) partners to drive the AWS best practices adoption within their organizations.