Implementing multi-Region failover for Amazon API Gateway

This post is written by Marcos Ortiz, Principal AWS Solutions Architect and Khubyar Behramsha, Sr. AWS Solutions Architect.

In this post, you learn how organizations can evolve from a single-Region architecture API Gateway to a multi-Region one, using a reliable failover mechanism without dependencies on AWS control plane operations. An AWS Well-Architected best practice is to rely on the data plane and not the control plane during recovery. Failover controls should work with no dependencies on the primary Region. This pattern shows how to independently failover discrete services deployed behind a shared public API. Additionally, there is a walkthrough on how to deploy and test the proposed architecture, using our open-source code available on GitHub.

For many organizations, running services behind a Regional Amazon API Gateway endpoint aligned to AWS Well-Architected best practices, offers the right balance of resilience, simplicity, and affordability. However, depending on business criticality, regulatory requirements, or disaster recovery objectives, some organizations must deploy their APIs using a multi-Region architecture.

When dealing with business-critical applications, organizations often want full control over how and when to trigger a failover. A manually triggered failover allows for dependencies to be failed over in a specific order. Failover actions follow the chain of approvals needed, which helps prevent failing over to an unprepared replica or other flapping issues caused by intermittent disruptions. While the failover action or trigger has a human-in-the-loop component, the recommendation is for all subsequent actions to be automated as much as possible. This approach gives application owners control over the failover process, including the ability to trigger the failover in cases of intermittent issues.

Overview

One common approach for customers is to deploy a public Regional API with a custom domain name, providing more intuitive URLs for their users. The backend uses API mappings to connect multiple API stages to a custom domain. This approach allows service owners to deploy their services independently while sharing the same top-level API domain name. Here is a typical architecture that follows this pattern:

Regional endpoint with mapping

However, when trying to evolve this to a multi-Region architecture, organizations often struggle to fail over each service independently. If the preceding architecture is deployed in two Regions as-is, it becomes an all-or-nothing scenario, where organizations must either fail over all the services behind API Gateway or none.

Evolving to a multi-Region architecture

To enable each team to manage and failover their services independently, you can implement this new approach for a multi-Region architecture. Each service has its own subdomain, using API Gateway HTTP integrations to route the request to a given service. This allows the service APIs the flexibility to be independently failed over, or all at once, with the shared public API.

Multi-Region architecture

This is the request flow:

Users access a specific service through the public shared API domain name using a URL suffix. For instance, to access service1, the end user would send a request to http://example.com/service1.
Amazon Route 53 has the top-level domain, example.com, registered with a primary and a secondary failover record. It routes the request to the API Gateway external API endpoint in the primary Region (us-east-1).
API Gateway uses an HTTP integration to forward the request to service1 at https://service1.example.com.
Amazon Route 53, has the domain service1.example.com registered with a primary and a secondary failover record. It routes the request to the API Gateway service1 API Regional endpoint in the primary Region (us-east-1) when healthy and routes to the service1 API Regional endpoint in the secondary Region (us-west-2) when unhealthy.
Represents the primary route for service1 configured in Amazon Route 53.
Represents the secondary route for service1 configured in Amazon Route 53.

This solution requires deploying each service API in both the primary (us-east-1) and secondary (us-west-2) Regions. Both Regions use the same custom domain configuration. For the primary Region, primary DNS records for each service point to the Regional API Gateway distribution endpoint. In the secondary Region, secondary DNS records for each service point to the Regional API Gateway distribution endpoint in the secondary Region.

Route 53 records

Active-passive manual failover

The example provided here enables a reliable failover mechanism that does not rely on the Amazon Route 53 control plane. It uses Amazon Route 53 Application Recovery Controller (Route 53 ARC), which provides a cluster with five Regional endpoints across five different AWS Regions. The failover process uses these endpoints, instead of manually editing Amazon Route 53 DNS records, which is a control plane operation. The routing controls in Route 53 ARC failover traffic from the primary Region to the secondary one.

Route 53 ARC routing controls

Routing controls are on-off switches that enable you to redirect client traffic from one instance of your workload to another. Traffic re-routing is the result of setting associated DNS health checks as healthy or unhealthy.

Route 53 ARC toggles

Deploying the sample application

Pre-requisites

A public domain (example.com) registered with Amazon Route 53. Follow the instructions here on how to register a domain and the instructions here to configure Amazon Route 53 as your DNS service.
An AWS Certificate Manager certificate (*.example.com) for your domain name on both the primary and secondary Regions you plan to deploy the sample APIs.

Deploy the Amazon Route 53 ARC stack

Deploy the Amazon Route 53 ARC stack first, which creates a cluster and the routing controls that enable you to fail over the APIs.

Follow the detailed instructions here to deploy the Amazon Route 53 Application Recovery Controller (ARC) stack.

Deploy the Service1 API both in the primary and secondary Regions

This deploys an API Gateway Regional endpoint in each Region, which calls an AWS Lambda function to return the service name and the current AWS Region serving the request:

{"service": "service1", "region": "us-east-1"}

This is the code for the Lambda function:

import json
import os

def lambda_handler(event, context):
    return {
"statusCode": 200,
"body": json.dumps({
  "service": "service1",
  "region": os.environ['AWS_REGION']}),
}

Follow the detailed instructions here to deploy the service1 stack.

Deploy the Service2 API both in the primary and secondary Regions

This stack is similar to service1, but has a different domain name and returns service2 as the service name:

{"service": "service2", "region": "us-east-1"}

Follow the detailed instructions here to deploy the service2 stack.

Deploy the shared public API both in the primary and secondary Regions

This step configures HTTP endpoints so that when you call example.com/service1 or example.com/service2, it routes the request to the respective public DNS records you have set up for service1 and service2.

Follow the detailed instructions here to deploy the external API stack.

Failover tests

To test the deployed example, modify then run the provided test script:

Update lines 3–5 in the test.sh file to reference the domain name you configured for your APIs.
Provide execute permissions and run the script:

chmod +x ./test/sh
./test.sh

This script sends an HTTP request to each one of your three endpoints every 5 seconds. You can then use Amazon Route 53 ARC to fail over your services independently and see the responses served from different Regions.

Initially, all services are routing traffic to the us-east-1 Region:

Initial routing

With the following command, you update two routing controls for service1, setting the primary Region (us-east-1) health check state to off, and the secondary Region (us-west-2) health check state to on:

aws route53-recovery-cluster update-routing-control-states \
 --update-routing-control-state-entries \
 '[{"RoutingControlArn":"arn:aws:route53-recovery-control::111122223333:controlpanel/0123456bbbbbbb0123456bbbbbb0123456/routingcontrol/abcdefg1234567","RoutingControlState":"On"},
{"RoutingControlArn":"arn:aws:route53-recovery-control:: 111122223333:controlpanel/0123456bbbbbbb0123456bbbbbb0123456/routingcontrol/hijklmnop987654321","RoutingControlState":"Off"}]' \
 --region ap-southeast-2 \
 --endpoint-url https://abcd1234.route53-recovery-cluster.ap-southeast-2.amazonaws.com/v1

After a few seconds, the script terminal shows that service1 is now routing traffic to us-west-2, while the other services are still routing traffic to the us-east-1 Region.

Flipping service1 to backup Region

To fail back service1 to the us-east-1 Region, run this command, now setting the service1 primary Region (us-east-1) health check state to on, and the secondary Region (us-west-2) health check state to off:

aws route53-recovery-cluster update-routing-control-states \
 --update-routing-control-state-entries \
 '[{"RoutingControlArn":"arn:aws:route53-recovery-control::111122223333:controlpanel/0123456bbbbbbb0123456bbbbbb0123456/routingcontrol/abcdefg1234567","RoutingControlState":"Off"},
{"RoutingControlArn":"arn:aws:route53-recovery-control:: 111122223333:controlpanel/0123456bbbbbbb0123456bbbbbb0123456/routingcontrol/hijklmnop987654321","RoutingControlState":"On"}]' \
 --region ap-southeast-2 \
 --endpoint-url https:// abcd1234.route53-recovery-cluster.ap-southeast-2.amazonaws.com/v1

After a few seconds, the script terminal shows that service1 is now routing traffic to the us-east-1 Region again, like the other services.

Routing recovery

Cleaning up

After you are finished, follow the cleanup instructions on GitHub.

Conclusion

This solution helps put the control back in the hands of the teams managing critical workloads using API Gateway. By decoupling the frontend and backend, this solution gives organizations granular control over failover at the service level using Amazon Route 53 ARC to remove dependencies on control plane actions.

The pattern outlined also reduces the impact to consumers of the service as it allows you to use the same public API and top-level domain when moving from a single-Region to a multi-Region architecture.

For more resilience learning, visit AWS Architecture Blog – Resilience.

For more serverless learning, visit Serverless Land.

AWS Compute Blog