AWS Database Blog

Use Amazon S3 to Store a Single Amazon Elasticsearch Service Index

September 8, 2021: Amazon Elasticsearch Service has been renamed to Amazon OpenSearch Service. See details.

As detailed in our documentation, you can use the Elasticsearch API actions in Amazon Elasticsearch Service to take manual snapshots of your domain. You can easily back up your entire domain this way. However, did you know you can also snapshot and restore a single index, or multiple indexes? This blog post walks you through backing up and restoring a single index by using an Amazon S3 bucket.

Note: This blog post uses an Amazon Elasticsearch Service (Amazon ES) version 5.3 domain.

If you’re running a log analytics workload, use this technique to move older indices off of your cluster, retaining them in S3 for future use. You’ll save on cost but still be able to retrieve and explore the data. You can also use this technique to migrate an index from one Amazon ES domain to another for version upgrades. You can also copy the index to another AWS Region and deploy it there for a cross-region copy.

Set up Amazon S3 and AWS Identity and Access Management (IAM)
The first thing you need to do is create a bucket in Amazon S3. I named my bucket es-s3-repository.

Now you need to create an IAM role with a policy that allows Amazon ES to write to your bucket. I describe this following, but for more details, see the IAM documentation.

To create this IAM policy, open the IAM console, switch to the Policies tab, and choose Create Policy. Select Create Your Own Policy, and give your policy a name (I named mine es-s3-repository).

Paste the following policy document in the provided text box on the console. Make sure to substitute the name of your bucket where I have es-s3-repository. This policy document grants list, get, put, and delete object permissions to whomever assumes the role to which it is attached. When you’re done, choose Create Policy.

{
    "Version":"2012-10-17",
    "Statement": [
        {
            "Action": [ "s3:ListBucket" ],
            "Effect": "Allow",
            "Resource": [ "arn:aws:s3:::es-s3-repository" ]
        },
        {
            "Action": ["s3:GetObject",
                       "s3:PutObject",
                       "s3:DeleteObject",
                       "iam:PassRole"
                      ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::es-s3-repository/*"
            ]
        }
    ]
}

You can use the AWS Command Line Interface (AWS CLI) to create a role to attach the policy document. If you haven’t already, you should install the AWS CLI.

Next, run the following command to create the new policy. Make sure to replace es-s3-repository with a different name.

aws iam create-role --role-name es-s3-repository --assume-role-policy-document '{"Version": "2012-10-17", "Statement": [{"Sid": "", "Effect": "Allow", "Principal": {"Service": "es.amazonaws.com"}, "Action": "sts:AssumeRole"}]}'

Return to the IAM console, choose the Roles tab. If you don’t see your role in the list, choose  to refresh the list. Choose the role you just created. Copy the role’s Amazon Resource Name (ARN)—you’ll need it in a minute. Then choose Attach Policy. Find the policy you created preceding, select the check box by it, and then choose Attach Policy again.

You now have a role that sets up a trust relationship between Amazon ES and Amazon S3, allowing Amazon ES to write to your S3 repository bucket.

Set up a repository in Amazon ES
You use Elasticsearch’s _snapshot API action to register a repository with Amazon ES. To call this action, you need a client that can send requests to your domain’s endpoint.

My domains are set up for user-based access control, requiring me to sign my requests with AWS SigV4 signing. To simplify ad hoc access to my cluster, I use a signing proxy, available here on github.

Note: This signing proxy was developed by a third party, not AWS. AWS is not responsible for the functioning or suitability of external content.  This proxy is great for development and test, but is not suitable for production workloads.

Download and install the proxy, then point it at your domain. You can use curl, sending to localhost:9200.

curl -XPUT 'http://localhost:9200/_snapshot/snapshot-repository' -d'{
    "type": "s3",
    "settings": {
        "bucket": "es-s3-repository",
        "region": "us-west-2",
        "role_arn": "arn:aws:iam::123456789012:role/es-s3-repository"
    }
}'

Replace snapshot-repository with your own repository name. Replace es-s3-repository, us-west-2, and arn:aws:iam::123456789012:role/es-s3-repository with your S3 bucket’s and role’s details.

Note: You must use the PUT, REST call to do this.

Back up an index
Once you have your repository set up, you use the _snapshot API action to create a snapshot of your index. By default, Amazon ES snapshots the entire cluster’s indexes. However, you can alter this behavior with the “indices” setting.

curl -XPUT localhost:9200/_snapshot/es-s3-repository/snapshot_1
{
  "indices": "movies",
  "ignore_unavailable": true,
  "include_global_state": false
}

You can specify one or more indexesfor backup in this snapshot. You can choose to ignore errors if any of the indexes is not available by setting ignore_unavailable to false. You also might want to store the cluster’s global state. However, for index-level backups, doing this usually won’t make sense. Set include_global_state to false.

You can also use the _snapshot API action to check the status of your snapshot or get a list of all snapshots in the repository.

curl -XGET localhost:9200/_snapshot/es-s3-repository/_status
curl -XGET localhost:9200/_snapshot/es-s3-repository/_all

Restore your index to the same cluster

You can restore a single index from a snapshot. To restore, you must ensure that the index is not open. This state is normally the case if you are reloading old data into the same cluster. If the index is open, delete the existing index first, then restore your backup.

curl -XPOST localhost:9200/_snapshot/es-s3-repository/snapshot_1/_restore -d'{
  "indices": "movies",
  "ignore_unavailable": false,
  "include_global_state": false
}'

When you restore, you specify the snapshot, and also any indexes that are in the snapshot. You do this in much the same way as when you took the snapshot in the first place.

Where to go from here

By snapshotting a single index, you give yourself flexibility in where and when you restore the data.

For time-based, streaming data, you use a rolling set of indexes. Usually, you create a new index for every day. You manage your storage and resource usage by limiting the number of days that you retain data, dropping the oldest index every day. Before you drop the index, back it up to S3. This backup lets you restore it in the future when you need to look further back than what you have in your “hot cluster.”

In all cases, you can use single-index (or full-cluster) backups to migrate your data between regions, between clusters, and even between Amazon ES versions (compatibility permitting).


About the Author

Jon Handler (@_searchgeek) is an AWS solutions architect specializing in search technologies. He works with our customers to provide guidance and technical assistance on database projects, helping them improve the value of their solutions when using AWS.