AWS Big Data Blog

Best practices for configuring your Amazon OpenSearch Service domain

August 2024: This post was reviewed and updated for accuracy.

Amazon OpenSearch Service is a fully managed service that makes it easy to deploy, secure, scale, and monitor your OpenSearch cluster in the AWS Cloud. Elasticsearch and OpenSearch are a distributed database solution, which can be difficult to plan for and execute. This post discusses some best practices for deploying Amazon OpenSearch Service domains.

The most important practice is to iterate. If you follow these best practices, you can plan for a baseline OpenSearch deployment. Amazon OpenSearch Service behaves differently for every workload—its latency and throughput are largely determined by the request mix, the requests themselves, and the data or queries that you run. There is no deterministic rule that can 100% predict how your workload will behave. Plan for time to tune and refine your deployment, monitor your domain’s behavior, and adjust accordingly.

Deploying Amazon OpenSearch Service

Whether you deploy on the AWS Management Console, in AWS CloudFormation, or via OpenSearch APIs, you have a wealth of options to configure your domain’s hardware, high availability, and security features. This post covers best practices for choosing your data nodes and your dedicated master nodes configuration.

When you configure your Amazon OpenSearch Service domain, you choose the instance type and count for data and the dedicated master nodes. Elasticsearch and OpenSearch are a distributed database that runs on a cluster of instances or nodes. These node types have different functions and require different sizing. Data nodes store the data in your indexes and process indexing and query requests. Dedicated master nodes don’t process these requests; they maintain the cluster state and orchestrate. This post focuses on instance types. For more information about instance sizing for data nodes, see Get started with Amazon OpenSearch Service: T-shirt-size your domain. For more information about instance sizing for dedicated master nodes, see Get Started with Amazon OpenSearch Service: Use Dedicated Master Instances to Improve Cluster Stability.

Amazon OpenSearch Service supports multiple instance classes: M, R, I, C, OR1, Im4gn and T. As a best practice, use the latest generation instance type from each instance class. As of this writing, these are the OR1, Im4gn, M6g, R6g, I3, C6g, and T3.

Choosing your instance type for data nodes

When considering data nodes, for an entry-level workload, choose the M6gs. The C6gs are a specialized instance, relevant for heavy query use cases, which require more CPU work than disk or network. Use the T3 instances for development or QA workloads, but not for production deployments. For more information about how many instances to choose, and a deeper analysis of the data handling footprint, see Sizing Amazon OpenSearch Service domains.

The Graviton-based instance types for Amazon OpenSearch Service include M6g, C6g and R6g, OR1 and Im4gn instance type. These Graviton instances are powered by AWS Graviton processors, which are custom-designed by AWS to provide the best price-performance for cloud workloads.

Im4gn instances type was announced in 2023 and well-suited for workloads that require large amounts of dense SSD storage and high compute performance, but are not especially memory intensive, so well suited for storage intensive workloads. Im4gn instances are suited for log analytics and large datasets as they are optimized for workloads that manage large datasets and need high storage density per vCPU, which are typical for log analytics use cases. Im4gn instances offer up to 30 TB of NVMe SSD instance storage, providing high storage density per vCPU. Im4gn instances support OpenSearch versions and Elasticsearch versions 7.9 and above.

OR1 instances type was announced in 2023 and it can provide up to 30% price-performance improvement over existing instances, are best suited for indexing-heavy operational analytics workloads such as log analytics, observability, or security analytics. OR1 instances are designed to store large amounts of data cost-effectively.  OR1 instances are designed to store large amounts of data cost-effectively. They use Amazon EBS volumes for primary storage, with data synchronously copied to Amazon S3 for increased durability.

OR1 instances support OpenSearch versions 2.11 and above. OR1 instances provides improved reliability as in the event of a failure, OR1 instances offer automatic data recovery to the last successful operation, improving the overall reliability of your OpenSearch domain.

Choosing your instance type for dedicated master nodes

When choosing an instance type for your dedicated master nodes, keep in mind that these nodes are primarily CPU-bound, with some RAM and network demand as well. We recommend M6g.large.search master nodes for small workloads, up to about 10 data nodes, C6g.2xlarge.search instances for domains with up to 30 data nodes, R6g.xlarge.search instances for domains with up to 75 data nodes, and R6g.2xlarge.search instances for domains larger than 75 data nodes.

For the purpose of sizing dedicated master nodes, count both hot and UltraWarm nodes as data nodes.

Choosing Availability Zones

Amazon OpenSearch Services makes it easier to increase the availability of your cluster by using the Zone Awareness feature. You can choose to deploy your data and master nodes in one, two, or three Availability Zones; with or without standby.

As a best practice for your most critical workloads, you can deploy using Multi-AZ with Standby, which configures three Availability Zones, with two zones active and one acting as a standby, and with and two replica shards per index.

When you choose more than one Availability Zone, Amazon OpenSearch Service deploys data nodes equally across the zones and makes sure that replicas go into different zones than their primaries. Additionally, when you choose more than one Availability Zone, Amazon OpenSearch Service always deploys dedicated master nodes in three zones (if the Region supports three zones). Deploying into more than one Availability Zone gives your domain more stability and increases your availability.

OpenSearch index and shard design

When you use Amazon OpenSearch Service, you send data to indexes in your cluster. An index is like a table in a relational database. Each search document is like a row, and each JSON field is like a column.

Amazon OpenSearch Service partitions your data into shards, with a random hash by default. You must configure the shard count, and you should use the best practices in this section.

Index patterns

For log analytics use cases, you want to control the life cycle of data in your cluster. You can do this with a rolling index pattern. Each day, you create a new index, then archive and delete the oldest index in the cluster. You define a retention period that controls how many days (indexes) of data you keep in the domain based on your analysis needs. For more information, see Index State Management.

Setting your shard counts

There are two types of shards: primary and replica. The primary shard count defines how many partitions of data Amazon OpenSearch Service creates. The replica count specifies how many additional copies of the primary shards it creates. You set the primary shard count at index creation and you can’t change it (there are ways, but it’s not recommended to use the _shrink or _split API for clusters under load at scale). You also set the replica count at index creation, but you can change the replica count on the fly and Amazon OpenSearch Service adjusts accordingly by creating or removing replicas.

You can set the primary and replica shard counts if you create the index manually, with a POST command. A better way for most use cases is to create and maintain an index template. See the following code:

PUT _template/<template_name>
{
    "index_patterns": ["logs-*"],
    "settings": {
        "number_of_shards": 10,
        "number_of_replicas": 1
    },
    "mappings": {
        …
    }
}

When you set a template like this, every index that matches the index_pattern has the settings and the mapping (if you specify one) applied to that index. This gives you a convenient way of managing your shard strategy for rolling indexes. If you change your template, you get your new shard count in the next indexing cycle.

You should set the number_of_shards based on your source data size, using the following guideline: primary shard count = (daily source data in bytes * 1.25) / 50 GB.

For search use cases, where you’re not using rolling indexes, use 30 GB as the divisor, targeting 30 GB shards. However, these are guidelines. Always test with your own data, indexing, and queries to find your optimal shard size.

It is important to align your shard and instance counts so that your shards distribute equally across your nodes. You do this by adjusting shard counts or data node counts so that they are evenly divisible. For example, the default setting for OpenSearch is 5 primary shards and 1 replica (a total of 10 shards). You can get even distribution by choosing 2, 5, or 10 data nodes. Although it’s important to distribute your workload evenly on your data nodes, it’s not always possible to get every index deployed evenly. Use the shard size as the primary guide for shard count and make small (< 20%) adjustments, generally favoring more instances or smaller shards, based on even distribution.

Determining storage size

So far, you’ve mapped out a shard count, based on the storage needed. Now you need to make sure that you have sufficient storage and CPU resources to process your requests. First, find your overall storage need: storage needed = (daily source data in bytes * 1.25) * (number_of_replicas + 1) * number of days retention.

You multiply your unreplicated index size by the number of replicas and days of retention to determine the total storage needed. Each replica adds an additional storage need equal to the primary storage size. You add this again for every day you want to retain data in the cluster. For search use cases, set the number of days of retention to 1.

The total storage need drives a minimum on the instance type and instance based on the maximum storage that instance provides. If you’re using EBS-backed instances like the M6g or R6g, you can deploy EBS volumes up to the supported EBS volume size quotas.

For instances with ephemeral store, storage is limited by the instance type (for example, im4gn.8xlarge.search has 15 TB of attached storage). If you choose EBS, you should use the general purpose, GP3, volume type. Although the service does support the io1 volume type and provisioned IOPS, you generally don’t need them. Use provisioned IOPS only in special circumstances, when metrics support it.

Take the total storage needed and divide by the maximum storage per instance of your chosen instance type to get the minimum instance count.

After you have an instance type and count, make sure you have sufficient vCPUs to process your requests. Multiply the instance count by the vCPUs that instance provides. This gives you a total count of vCPUs in the cluster. As an initial scale point, make sure that your vCPU count is 1.5 times your active shard count. An active shard is any shard for an index that is receiving substantial writes. Use the primary shard count to determine active shards for indexes that are receiving substantial writes. For log analytics, only the current index is active. For search use cases, which are read heavy, use the primary shard count.

Although 1.5 is recommended, this is highly workload-dependent. Be sure to test and monitor CPU utilization and scale accordingly.

As you work with shard and instance counts, bear in mind that Amazon OpenSearch Service works best when the total shard count is as small as possible—fewer than 10,000 is a good soft limit. Each instance should also have no more than 25 shards total per GB of JVM heap on that instance. For example, the R6g.xlarge has 32 GB of RAM total. The service allocates half the RAM (16 GB) for the heap (the maximum heap size for any instance is 31.5 GB). You should never have more than 400 = 16 * 25 shards on any node in that cluster.

Use case

Assume you have a log analytics workload supporting Apache web logs (500 GB/day) and syslogs (500 GB/day), retained for 7 days. This post focuses on the R6g instance type as the best choice for log analytics. You use a three-Availability Zone deployment, one primary and two replicas per index. With a three-zone deployment, you have to deploy nodes in multiples of three, which drives instance count and, to some extent, shard count.

The primary shard count for each index is (500 * 1.25) / 50 GB = 12.5 shards, which you round to 15. Using 15 primaries allows additional space to grow in each shard and is divisible by three (the number of Availability Zones, and therefore the number of instances, are a multiple of 3). The total storage needed is 1,000 * 1.25 * 3 * 7 = 26.25 TB. You can provide that storage with 18x R6G.xlarge.search, 9x R6G.2xlarge.search, or 6x R6G.4xlarge.search instances (based on EBS limits of 1.5 TB, 3 TB, and 6 TB, respectively). You should pick the 4xlarge instances, on the general guideline that vertical scaling is usually higher performance than horizontal scaling (there are many exceptions to this general rule, so make sure to iterate appropriately).

Having found a minimum deployment, you now need to validate the CPU count. Each index has 15 primary shards and 2 replicas, for a total of 45 shards. The most recent indexes receive substantial write, so each has 45 active shards, giving a total of 90 active shards. You ignore the other 6 days of indexes because they are infrequently accessed. For log analytics, you can assume that your read volume is always low and drops off as the data ages. Each R6G.4xlarge.search has 16 vCPUs, for a total of 96 in your cluster. The best practice guideline is 135 = 90 * 1.5 vCPUs needed. As a starting scale point, you need to increase to 9x R6G.4xlarge.search, with 144 vCPUs. Again, testing may reveal that you’re over-provisioned (which is likely), and you may be able to reduce to six. Finally, given your data node and shard counts, provision 3x M6g.large.search dedicated master nodes.

Conclusion

This post covered some of the core best practices for deploying your Amazon OpenSearch Service domain. These guidelines give you a reasonable estimate of the number and type of data nodes. Stay tuned for subsequent posts that cover best practices for deploying secure domains, monitoring your domain’s performance, and ingesting data into your domain.


About the Authors

Jon Handler (@_searchgeek) is a Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with the CloudSearch and Amazon OpenSearch Service teams, providing help and guidance to a broad range of customers who have search workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included four years of coding a large-scale, eCommerce search engine. Jon holds a Bachelor of the Arts from the University of Pennsylvania, and a Master of Science and a PhD in Computer Science and Artificial Intelligence from Northwestern University.

Nikhil Agarwal is Sr. Technical Manager with Amazon Web Services. He is passionate about helping customers achieve operational excellence in their cloud journey and working actively on technical solutions. He is also AI/ML enthusiastic and deep dives into customer’s ML-specific use cases. Outside of work, he enjoys traveling with family and exploring different gadgets.

Gene Alpert is a Senior Analytics Specialist with AWS Enterprise Support. He has been focused on our Amazon OpenSearch Service customers and ecosystem for the past three years. Gene joined AWS in 2017. Outside of work he enjoys mountain biking, traveling, and playing Population:One in VR.


Audit History

Last reviewed and updated in August 2024 by Nikhil Agarwal | Sr. Technical Manager and Gene Alpert | Sr. Analytics Specialist.