AWS Big Data Blog

Achieve data resilience using Amazon OpenSearch Service disaster recovery with snapshot and restore

Amazon OpenSearch Service is a fully managed service offered by AWS that enables you to deploy, operate, and scale OpenSearch domains effortlessly. OpenSearch is a distributed search and analytics engine, which is an open-source project. OpenSearch Service seamlessly integrates with other AWS offerings, providing a robust solution for building scalable and resilient search and analytics applications in the cloud.

Disaster recovery is vital for organizations, offering a proactive strategy to mitigate the impact of unforeseen events like system failures, natural disasters, or cyberattacks.

In Disaster Recovery (DR) Architecture on AWS, Part I: Strategies for Recovery in the Cloud, we introduced four major strategies for disaster recovery (DR) on AWS. These strategies enable you to prepare for and recover from a disaster. By using the best practices provided in the AWS Well-Architected Reliability Pillar to design your DR strategy, your workloads can remain available despite disaster events such as natural disasters, technical failures, or human actions. OpenSearch Service provides various DR solutions, including active-passive and active-active approaches. This post focuses on introducing an active-passive approach using a snapshot and restore strategy.

Snapshot and restore in OpenSearch Service

The snapshot and restore strategy in OpenSearch Service involves creating point-in-time backups, known as snapshots, of your OpenSearch domain. These snapshots capture the entire state of the domain, including indexes, mappings, and settings. In the event of data loss or system failure, these snapshots will be used to restore the domain to a specific point in time. Implementing a snapshot and restore strategy helps organizations meet Recovery Point Objectives (RPOs) and Recovery Time Objectives (RTOs), providing minimal data loss and rapid system recovery in case of disasters.

Snapshot and restore results in longer downtimes and greater loss of data between when the disaster event occurs and recovery. However, backup and restore can still be the right strategy for your workload because it is the most straightforward and least expensive strategy to implement. Additionally, not all workloads require RTO and RPO in minutes or less.

Solution overview

The following architecture diagram illustrates how manual snapshots are taken from the OpenSearch Service domain in the primary AWS Region and stored in an Amazon Simple Storage Service (Amazon S3) bucket in the secondary Region.

We walk through each step and discuss scenarios for failing over to the OpenSearch Service domain in the secondary Region in the event of a disaster in the primary Region, as well as how to fail back to the OpenSearch Service domain to resume operations in the primary Region.

bdb-4227-Arch1.1

The workflow consists of the following initial steps:

  1. OpenSearch Service is hosted in the primary Region, and all the active traffic is routed to the OpenSearch Service domain in the primary Region.
  2. The manual snapshots from the OpenSearch Service domain in the primary Region are transferred to the S3 bucket in the secondary Region on a predefined schedule.

This process can be programmatically scheduled using an AWS Lambda function, as described in Unleash the power of Snapshot Management to take automated snapshots using Amazon OpenSearch Service. This gives you the most effective protection from disasters of any scope of impact. In the event of a disaster in the primary Region, in addition to OpenSearch data recovery from backup, you must also be able to restore your infrastructure in the secondary Region. Infrastructure as code (IaC) methods such as using AWS CloudFormation or the AWS Cloud Development Kit (AWS CDK) enable you to deploy consistent infrastructure across Regions.

The following diagram illustrates the architecture in the event of a disaster.

bdb-4227-Arch1.2

The workflow consists of the following steps:

  1. In the event of a disaster making the OpenSearch Service domain in the primary Region unavailable, all active traffic routed to the primary Region’s OpenSearch Service domain will cease.
  2. When the OpenSearch Service domain becomes unavailable, the manual snapshots to Amazon S3 will no longer be taken at the predefined intervals.
  3. To fail over, launch the OpenSearch Service domain in the secondary Region using IaC. Restore manual snapshots from the S3 bucket in the secondary Region to the OpenSearch Service domain in the secondary domain. For log workloads, restore only recent or relevant logs to save time and use this opportunity to purge unnecessary documents or indexes.
  4. Update the DNS controller (Amazon Route 53) to redirect traffic to the OpenSearch Service domain in the secondary Region.
  5. When the primary Region becomes available, set up manual snapshots from the OpenSearch Service domain in the secondary Region to the S3 bucket in the primary Region.

The following diagram illustrates the architecture after the primary Region becomes available.

bdb-4227-Arch1.3

The workflow consists of the following steps:

  1. When the primary Region becomes available again, destroy the existing OpenSearch domain in the primary Region. Launch a new OpenSearch Service domain in the primary Region.
  2. Restore manual snapshots from the S3 bucket in the primary Region to the new OpenSearch Service domain created in the previous step.
  3. Update Route 53 to redirect traffic to the new OpenSearch Service domain in the primary Region.
  4. Set up manual snapshots from the new OpenSearch Service domain in the primary Region to a new prefix in the S3 bucket in the secondary Region.
  5. After successfully failing back to the OpenSearch Service domain in the primary Region, destroy the OpenSearch Service domain in the secondary Region.

In this post, we demonstrate how to launch an OpenSearch Service domain in the primary Region and set up manual snapshots to an S3 bucket in the secondary Region. Then we simulate a failover to resume operations using the OpenSearch Service domain in the secondary Region in the event of a disaster. Finally, we illustrate the failback mechanism by reverting to the OpenSearch Service domain in the primary Region.

Regular operations

In this section, we discuss the regular operations to set up the solution architecture.

Launch an OpenSearch Service domain in the primary Region

Create an OpenSearch Service domain in the primary Region by following the instructions in Creating and managing Amazon OpenSearch Service domains with fine-grained access control enabled. Do not enable standby mode. Create indexes and populate them with documents.

Create an S3 bucket in the secondary Region

To store OpenSearch snapshots in the secondary Region, you need to create S3 buckets in that Region. For instructions, see Creating a bucket.

Create the snapshot IAM role

The snapshot AWS Identity and Access Management (IAM) role is necessary to grant permissions specifically for managing snapshots within the OpenSearch Service domain. For instructions, see Creating an IAM role (console). We refer to this role as TheSnapshotRole in this post.

  1. Attach the following IAM policy to TheSnapshotRole:
    {
      "Version": "2012-10-17",
      "Statement": [{
          "Action": [
            "s3:ListBucket"
          ],
          "Effect": "Allow",
          "Resource": [
            "arn:aws:s3:::s3-bucket-name"
          ]
        },
        {
          "Action": [
            "s3:GetObject",
            "s3:PutObject",
            "s3:DeleteObject"
          ],
          "Effect": "Allow",
          "Resource": [
            "arn:aws:s3:::s3-bucket-name/*"
          ]
        }
      ]
    }
  2. Edit the trust relationship of TheSnapshotRole to specify OpenSearch Service in the Principal statement, as shown in the following example:
{
  "Version": "2012-10-17",
  "Statement": [{
    "Sid": "",
    "Effect": "Allow",
    "Principal": {
      "Service": "es.amazonaws.com"
    },
    "Action": "sts:AssumeRole"
  }]
}

To register the snapshot repository, you need to be able to pass TheSnapshotRole to OpenSearch Service. You also need access to the es:ESHttpPut action.

  1. To grant both of these permissions, attach the following policy to the IAM role whose credentials are being used to sign the request:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": "arn:aws:iam::123456789012:role/TheSnapshotRole"
    },
    {
      "Effect": "Allow",
      "Action": "es:ESHttpPut",
      "Resource": "arn:aws:es:region:123456789012:domain/domain-name/*"
    }
  ]
}

Associate the IAM role or user to the OpenSearch security role for manual snapshots

Fine-grained access control introduces an additional step when registering a repository. Even if you use HTTP basic authentication for all other purposes, you need to map the manage_snapshots role to your IAM role that has iam:PassRole permissions to pass TheSnapshotRole. Snapshots can only be taken by a process or user associated with an IAM identity. This makes sure only authorized entities can create, manage, or restore snapshots.

One such method is to use Amazon Cognito. With Amazon Cognito, users can sign in with IAM credentials indirectly, either using proxy mapping with SAML or through user pool credentials. This setup provides a secure way to manage access while using the capabilities of IAM. The preferred method is to use a process that signs requests with AWS SigV4. This approach involves programmatically signing each request to OpenSearch with the appropriate IAM credentials, making sure only authorized processes can manage snapshots. This method is recommended because it provides a higher level of security and can be automated using Lambda functions as part of your backup and DR workflows.

  1. On OpenSearch Dashboards, navigate to the main menu and choose Security.
  2. Choose Roles and search for the manage_snapshots
  3. Choose Mapped users and choose Manage mappings.
  4. Add the Amazon Resource Name (ARN) of TheSnapshotRole to the backend roles.

bdb-4227-AssociateRole

Register a snapshot repository on the OpenSearch Service domain

To register a snapshot repository, send a signed PUT request to the OpenSearch Service domain endpoint using Curl; integrated development environments (IDEs) like PyCharm or VS Code, Postman; or another method. Using a PUT request in OpenSearch Dashboards for repository registration is not supported. For more details, see Using OpenSearch Dashboards with Amazon OpenSearch Service.

The curl command is as follows:

curl —aws-sigv4 "aws:amz:us-east-1:es" —user "ACCESS_KEY:SECRET_KEY" -XPUT "https://DOMAIN_ENDPOINT/_snapshot/REPOSITORY_NAME" -H 'Content-Type: application/json' -d '{ "type": "s3", "settings": { "bucket": "BUCKET_NAME", "endpoint": "s3.amazonaws.com", "role_arn": "ROLE_ARN" }}'

Use the curl command to register a snapshot repository in the OpenSearch Service domain in the primary Region pointing to the S3 bucket in the secondary Region.

To verify the snapshot repository creation, run the following query:

GET /_snapshot/os-snapshot-repo

bdb-4227-GetSnapshot

Take manual snapshots

To take a manual snapshot, perform the following steps from OpenSearch Dashboards. To include or exclude certain indexes and specify other settings, add a request body. For the request structure, see Take snapshots in the OpenSearch documentation.

  1. To create a manual snapshot, use the following query. In this query, the repository name is os-snapshot-repo and the snapshot name is 2023-11-18.

PUT /_snapshot/os-snapshot-repo/2023-11-18

bdb-4227-PutSnapshot

  1. Verify the snapshot has been created and indexes for which snapshot was taken:

GET /_snapshot/os-snapshot-repo/_all

bdb-4227-GetAllSnapshots

  1. Schedule your manual snapshot at a defined interval (for example, every 1 hour) based on your RPO requirements.

You can schedule this by creating an Amazon EventBridge rule to invoke a Lambda function every hour. For instructions, see Tutorial: Create an EventBridge scheduled rule for AWS Lambda functions. The Lambda function will transfer incremental manual snapshots into Amazon S3. For more information, see Unleash the power of Snapshot Management to take automated snapshots using Amazon OpenSearch Service.

Failover scenario

In a disaster, if your OpenSearch Service domain in the primary Region goes down, you can fail over to a domain in the secondary Region. This provides business continuity and minimizes downtime during unexpected Region failures.

To maintain business continuity during a disaster, you can use message queues like Amazon Simple Queue Service (Amazon SQS) and streaming solutions like Apache Kafka or Amazon Kinesis. These tools buffer incoming data in the primary Region, allowing you to replay traffic on a predefined period in the secondary Region when you fail over, to keep the OpenSearch Service domain up to date with all recent changes.

Launch an OpenSearch Service domain in the Secondary Region

Create an OpenSearch Service domain in the secondary Region by following the instructions in Creating and managing Amazon OpenSearch Service domains with fine-grained access control enabled. Do not enable standby mode.

Depending on your RTO requirements, you can keep the OpenSearch Service domain in the secondary Region up and running if you have an RTO of less than 1 hour. However, it will incur additional costs. If you have an RTO of more than 1 hour, you can launch a new OpenSearch Service domain in the secondary Region during the failover activity to reduce operational costs.

Associate the IAM role or user to the OpenSearch security role for manual snapshots

Follow the instructions in the previous section to associate the IAM role with the OpenSearch security role.

Register a snapshot repository on the OpenSearch Service domain

To make sure your data is available for failover, you need to register a snapshot repository on the OpenSearch Service domain in the secondary Region. The snapshots taken from your OpenSearch Service domain in the primary Region can be restored. Use the following command:

curl —aws-sigv4 "aws:amz:us-west-2:es" —user "ACCESS_KEY:SECRET_KEY" -XPUT "https://DOMAIN_ENDPOINT/_snapshot/REPOSITORY_NAME" -H 'Content-Type: application/json' -d '{ "type": "s3", "settings": { "bucket": "BUCKET_NAME", "endpoint": "s3.amazonaws.com", "role_arn": "ROLE_ARN" }}'

The S3 bucket should be the bucket created in the secondary Region where the snapshots from your OpenSearch Service domain in the primary Region are stored.

Restore snapshots

Before you restore a snapshot, make sure that the destination domain doesn’t use Multi-AZ with standby.

After you register the snapshot repository on your OpenSearch Service domain in the secondary Region, the next step is to restore the desired indexes from the snapshot repository. This step makes sure your data is available in the OpenSearch Service domain in the secondary Region. This step allows you to selectively restore specific index from your snapshot, providing flexibility to recover only the necessary data. Use the following command:

POST /_snapshot/<REPOSITORY_NAME>/<SNAPSHOT_NAME>/_restore
{
"indices": "movie-index"
}

bdb-4227-Restore

Verify the snapshots for all the necessary indexes are stored in the OpenSearch Service domain in the secondary Region.

Update Route 53 to redirect traffic to the OpenSearch Service domain in the secondary Region

After you restore the snapshots to the OpenSearch Service domain in the secondary Region, update the DNS settings (Route 53) with the new OpenSearch Service domain endpoint to redirect indexing traffic to the OpenSearch Service domain in the secondary Region. Route 53, a scalable DNS service, can seamlessly redirect traffic to the new OpenSearch endpoint by updating its DNS records.

A Route 53 resource record set directs internet traffic to specific resources, such as an OpenSearch Service domain. It includes a domain name, a record type (for example, CNAME), and the DNS name or IP address of the endpoint. To redirect traffic to a new endpoint, update or create a new record set.

Set up manual snapshots from the OpenSearch Service domain in the secondary Region to the Amazon S3 bucket in the primary Region

Complete the following steps to set up manual snapshots from the OpenSearch Service domain in the secondary Region to the S3 bucket in the primary Region:

  1. Create S3 bucket in the primary Region, following the steps from earlier in this post.
  2. Associate the IAM role or user to the OpenSearch security role for taking manual snapshots in your OpenSearch Service domain in the secondary Region. For instructions, refer to the earlier section in this post.
  3. Register a snapshot repository on the OpenSearch Service domain in the secondary Region pointing to the S3 bucket in the primary Region. For instructions, refer to the earlier section in this post.
  4. Take manual snapshots of the OpenSearch Service domain in the secondary Region to the S3 bucket in the primary Region, following the instructions from earlier in this post.
  5. Schedule your manual snapshot from the OpenSearch Service domain in the secondary Region to the S3 bucket in the primary Region at a defined interval (for example, every 1 hour) based on your RPO requirements.

Failback scenario

When the primary Region becomes available again, you can seamlessly revert to the OpenSearch Service domain in the primary Region. This failback process involves the following steps.

Destroy an existing OpenSearch Service domain in the primary Region

When the primary Region becomes available again, destroy the existing OpenSearch Service domain in the primary Region from the OpenSearch Service console. In the following screenshot, the primary Region is US East (N. Virginia).

bdb-4227-DestroyDomain

Launch a new OpenSearch Service domain in the primary Region

Create an OpenSearch Service domain in the primary Region by following the instructions in Creating and managing Amazon OpenSearch Service domains with fine-grained access control. Do not enable standby mode.

Associate the IAM role or user to the OpenSearch security role for restoring manual snapshots

Follow the instructions from earlier in this post to associate the IAM role or user to the OpenSearch security role.

Register a snapshot repository on the OpenSearch Service domain

To make sure your data is available for failover, you need to register a snapshot repository on the new OpenSearch Service domain in the primary Region. The snapshots taken from your OpenSearch Service domain in the secondary Region can be restored. Use the following command:

curl —aws-sigv4 "aws:amz:us-west-2:es" —user "ACCESS_KEY:SECRET_KEY" -XPUT "https://DOMAIN_ENDPOINT/_snapshot/REPOSITORY_NAME" -H 'Content-Type: application/json' -d '{ "type": "s3", "settings": { "bucket": "BUCKET_NAME", "endpoint": "s3.amazonaws.com", "role_arn": "ROLE_ARN" }}'

The S3 bucket should be the bucket created in the primary Region where the snapshots from your OpenSearch Service domain in the secondary Region are stored.

Restore manual snapshots from the S3 bucket in the primary Region to the new OpenSearch Service domain in the primary Region

To restore the manual snapshots, complete the following steps:

  1. Use the following code to restore the manual snapshots from the S3 bucket in the primary Region to the new OpenSearch Service domain in the primary Region:

POST /_snapshot/os-snapshot-repo/2023-11-18/_restore
{
"indices": "movie-index"
}

bdb-4227-Restore

  1. Verify data integrity and make sure the primary domain is up to date by checking the document count of the index:

GET movie-index/_count

bdb-4227-IndexCount

  1. Update Route 53 to redirect traffic to the new OpenSearch Service domain in the primary Region.
  2. Set up manual snapshots from the new OpenSearch Service domain in the primary Region to a new prefix in the S3 bucket in the secondary Region.

Destroy the OpenSearch Service domain in the secondary Region

After you have successfully failed back to the OpenSearch Service domain in the primary Region, destroy the OpenSearch Service domain in the secondary Region. In the following screenshot, the secondary Region is US West (Oregon).

bdb-4227-DestroyDomain2

Conclusion

In this post, we explained how you can implement a DR pattern on OpenSearch Service using a snapshot and restore strategy. It’s highly recommended to define your RPO and RTO for your workload and choose an appropriate DR strategy. Then, using AWS services, you can design an architecture that achieves the RTO and RPO for your business needs.


About the Authors

Samir Patel is a Senior Data Architect at Amazon Web Services, where he specializes in OpenSearch, data analytics, and cutting-edge generative AI technologies. Samir works directly with enterprise customers to design and build customized solutions catered to their data analytics and cybersecurity needs. When not immersed in technical work, Samir pursues his passion for outdoor activities, including hiking, pickleball, and grilling with family and friends.

Sesha Sanjana Mylavarapu is an Associate Data Lake Consultant at AWS Professional Services. She specializes in cloud-based data management and collaborates with enterprise clients to design and implement scalable data lakes. She has a strong interest in data analytics and enjoys assisting customers solve their business and technical challenges. Beyond her professional pursuits, Sanjana enjoys hiking, playing guitar, and is passionate about teaching yoga.

Vivek Gautam is a Senior Data Architect with specialization in data analytics at AWS Professional Services. He works with enterprise customers building data products, analytics platforms, streaming, and search solutions on AWS. When not building and designing data products, Vivek is a food enthusiast who also likes to explore new travel destinations and go on hikes.