AWS Database Blog
Benchmarking Amazon Aurora Limitless with pgbench
Amazon Aurora Limitless is a database solution that grows and shrinks vertically and horizontally with the current workload requirements. It’s a managed solution based on the Amazon Aurora infrastructure, and isn’t constrained by a single node’s limitations.
In this post, we show you how to test performance with the common tool pgbench. This tool is used with single-node database management systems (DBMS) and is optimized for single-node use cases. As we shall see in this post, this doesn’t mean that the tool measure what we think when it comes to multi-node systems. We demonstrate how it works with Aurora Limitless. We also discuss the obstacles and opportunities you might encounter when using this tool with Aurora Limitless.
Performance and scalability
Your organization might measure performance based on the following traits:
- Latency– There are statements, specifically reads and writes, that must complete within a period of time.
- Transaction rate– There is a need for a minimum transactions per second, per minute, per hour, or per day, or a peak rate within a period of time.
- Scalability– There is a need to grow without built-in limitations.
Typically, performance is measured based on all three of these traits. What use is a high transaction rate unless users and applications get a response in a reasonable period of time? What use is good latency if it’s only when there are very few users? Scalability involves many moving parts and the purpose of scalability tests is often to measure the limit of immovable parts.
Scalability points to a larger picture that includes architecture. The different metrics, such as latency, can be a priority or a requirement.
Scalability should be tested well before applications outgrow the current solution. Tests give two important properties related to scalability: how the system behaves under more stress than with the real-world applications, and at what workload the system reaches a limit.
The following are some things to consider when testing performance:
- Workload– The more complex the workload, the harder it usually is to run the workload, and understand the result. In contrast, the smaller the workload, the smaller the surface area tested. However, with simplicity comes ease of understanding.
- Schema– The larger the schema, the more time required to set up, identify the workload, load, and reset. The smaller the schema, the more limited the choices for your workload, but the more precise the measurements and the stress points.
- Scalability – This can be defined in various terms: number of users, number of connections, amounts of data, transaction rate, and maximum transaction rate. It’s important to focus on one or very few aspects of scalability early on.
- Baseline– Lastly, you need a baseline to compare with. For example, consider a car that does 60 mph in 200 yards (or 100 kph in 200 meters): Is that good? Is that bad? To contrast that, many of us can assess how a car that does 60 mph in 6 seconds compares.
Determining the purpose of the exercise, schema, workload, scalability, and baseline are all important questions to ask before starting a scalability benchmark.
When scaling up, it’s important to understand what parameters are important to the business, the applications, and the users. It’s also important to understand what affects performance in the toolset and the architecture. A tool often used with PostgreSQL is pgbench
. In the following sections, we explore using pgbench
with Aurora Limitless.
Aurora Limitless terminology
In this post, we use the following terminology:
- Sharding – A strategy to divide data over multiple nodes.
- Shard key – The column or columns that data is divided along.
- Collocation – Declaring that tables’ rows should be collocated to the same shards. Collocation speeds up joins between collocated tables.
- Reference table – A type of table that exists in all shards with low Data Manipulation Language (DML) rates and typically static.
- Aurora Capacity Units (ACUs) – A metric for the resources dedicated to a shard group. For faster responsiveness, increase the minimum ACU, and for a higher capacity ceiling, increase the maximum ACU.
- Shard group – The collection of nodes in an Aurora Limitless system. The number changes with workload, data, and ACU setting.
- Router – The nodes that receive requests and either run them or push them down to shards. The latter performs and scales better.
- Shard – The nodes that stores and manages the data. They’re only accessible from the shard group’s routers.
- Pushdown – The fastest and most scalable way to run statements in a sharded system. The term refers to that execution is “pushed down” to the shards. These statements have equality predicates on all tables’ shard keys, the tables have the same shard keys and for joins these tables are collocated. Reference tables may be included in the statements.
pgbench
pgbench
is a popular tool for testing performance with PostgreSQL. The following is example code from a first run with Aurora Limitless:
pgbench
generates data on the client and then writes to the DBMS:
This doesn’t apply to all versions of pgbench
; some versions are optimized for loading for single-node PostgreSQL by employing COPY...FREEZE
. That optimization fails due to a documented limitation with partitioned tables. For more information, see the official documentation on the COPY
command.
Even thought the load fails the tables are created. Let’s see how they’re defined:
Let’s compare that with a pgbench
database created for Aurora Limitless:
The important difference is the column table_type
in the preceding code examples. In the first case, the type is standard
and in the second case, the type is sharded
. In the next section, we discuss what sharding is and how to create sharded tables.
Sharding
Sharding works by dividing up data into different chunks called shards. Each shard is then stored in a separate instance. This allows for greater scalability because reads and writes go to multiple instances. The strategy can be employed for different purposes and, because Aurora Limitless is an online transaction processing (OLTP) solution, the aim is to allow for a higher number of reads and writes. We declare a shard key and submit workloads, and Aurora Limitless divides the data and routes the statements.
Because Aurora Limitless must be able to tell how to route statements, we need to declare a shard key. The key, comprised of one or more columns, is used to determine in which instance rows with particular values are stored (or, in the case of INSERT, where to store them). Tables with the same shard key should be declared as collocated, so rows in different tables but with the same shard key are stored in the same shard.
Sharding and pgbench
The pgbench
default schema is inspired by the TPC-B benchmark, simulating a banking system in four tables: branches
, tellers
, accounts
, and history
, with many tellers per branch, and many accounts per branch. The primary key follows the tables: branch_id
, teller_id
, and account_id
. The relationships are modeled by having branch_id
as foreign keys in tellers
and in accounts
. Each transaction is inserted into the history table.
Typically, a combination of the schema and transactions provides the shard key. In this case, the clear choice is branch_id
because it’s in all four tables.
To shard tables during creation, we set the table mode as sharded and the shard key for only one column, branch_id
or bid
.
The branch ID is the primary key for the table pgbench_branches
, and we want rows from all sharded tables to be collocated with the rows in pgbench_branches
, all having the same value for shard key bid
.
This last option goes after the table pgbench_branches
is created, and therefore Aurora Limitless knows that the tables are collocated, which makes joins more efficient. The following snippet shows the runnable sequence of commands to create sharded and collocated tables:
After sharding, a SELECT from rds_aurora.limitless_tables
will show them as sharded:
The system view rds_aurora.limitless_table_collocations
shows collocated tables:
Creating indexes
Indexes can be created before or after the tables are loaded. Indexes are local, so for sharded tables, unique indexes must include the shard key. This goes for both unique and primary key constraints. This affects what workloads—statements and transactions—are run efficiently.
For the actual command, this means making a minor change to add a constraint. Note that pgbench_history
is not and usually should not be indexed:
A common best practice is that the more selective column is the first column in the index. Because there are 10 tellers per branch, and 100,000 accounts per branch, bid
is less selective than tid
and aid
.
Loading data
Before we resolve loading pgbench
data, let’s look at an example of the statements issued by pgbench
. First, let’s explore what the Aurora Limitless implementation of pg_stat_statements
says about the data load:
As we can see, the INSERT statement has a number—100—of row constructors, each with four placeholders. (For readability, the ellipses replace a number of row constructors and we will return to the column sso_calls
.) This is a common design pattern to improve performance of a single-node DBMS such as PostgreSQL. For Aurora Limitless, a different design pattern will improve performance, particularly scalability.
Design pattern for application insert and loading
There are different scenarios where inserts or data loads are optimized. This section discusses how applications insert data, not the initial load or bulk operations.
For single nodes, the common design pattern is to send a large number of rows in one step, as shown in the preceding statements. This design pattern and its gains don’t manifest in Aurora Limitless because each row must be evaluated. Aurora Limitless need to unpack the rows to determine to which instance the INSERT statement should be routed. Then it transmits each statement to the right instance while maintaining transactional consistency. This is the reason why pgbench
load is slow compared to a single-node DBMS.
For Aurora Limitless, it’s better to use many connections. Each connection submits one row to insert at a time. Then each row gets pushed to the instance where the row should reside. This can be 10, 100, or 1,000 connections—Aurora Limitless scales according to need.
Loading a pgbench database
There are several ways of loading pgbench
data, some faster, some more time-consuming. A simple way of loading pgbench
data in Aurora Limitless uses SQL and SQL functions. The scale
parameter sets the proportions of the tables, so scale 10000
means that pgbench_branches
will have 10,000 rows, pgbench_tellers
will have 100,000 rows, and pgbench_accounts
will have 1 billion rows. The following script loads the three main tables (pgbench_history
starts empty):
This results in the following:
Tables are created, sharded, indexed, and loaded. Then it’s time for vacuum:
A comment on scale
Scale 10,000 means that pgbench_branches
has 10,000 rows. This places a limit on the number of transactions since there will be concurrency issues. So, chose scaling factor advisedly taking the target throughput into account, and monitor wait events.
Running pgbench
We have created the tables, sharded, and collocated them. The tables are indexed and the indexes include the shard key. The data is loaded and the tables are vacuumed. Now we’re ready to try pgbench
on Aurora Limitless. The following code illustrates a 10-minute run with 100 users:
We can now determine if this is what we expected from pgbench
and Aurora Limitless.
Sharding and workload changes
The result from our first run isn’t what we expected. Let’s see what happened.
The transaction (singular, because there is only one type of transaction) is as follows:
In the following code, we can see how this transaction is reflected in rds_aurora.limitless_stat_statements
:
Checking the column calls
, we can see that the statements are run the same number of times. This is as expected.
But what about the column sso_calls
, and what does that mean?
Statements can be executed on the router or pushed down to a shard for execution. The former is a default strategy that can be thought of as bringing data to the router and then running the operator. The latter performs much better by moving the processing to the data, and there are numerous optimizations around single-shard optimized execution, which is what sso_calls
means.
For one update, the update on pgbench_branches
, we can see that for all executions – the column calls
are single-shard optimized because the values in calls
are equal to the number in sso_calls
. The other two UPDATE statements and the SELECT are done by sending the statements to all shards.
We can achieve single-shard optimized execution including the shard key in the WHERE
clause. This is true for pgbench_branches
because bid
is both the shard key and the table’s primary key.
As we can see in the preceding UPDATE statements on the tables, pgbench_tellers
and pgbench_accounts
don’t include predicates on the shard key. Therefore, they won’t be executed as efficiently as they could be.
Aurora Limitless can’t push statements down to the data unless it knows the shard key. Therefore, we need to check the workload and include an equality predicate on the shard key.
For pgbench
, we need to make changes to the actual workload—in this case, add a predicate on bid
to the UPDATE statements on pgbench_accounts
and pgbench_tellers
. The following is the original statement on table pgbench_accounts
:
In pgbench
scripts, this is straightforward to change, especially because bid
in the variable :bid
is already generated by the driver:
The complete script for running a pgbench
standard workload on Aurora Limitless adds an extra predicate on the shard key:
Before we see the difference between the original and the modified versions, let’s consider the impact on real-world applications.
Design patterns for sharded databases
For real-world applications, this is sometimes more complex; the following are some examples based on the common customer
, order
, and order_item
table structure.
In this case, customer_id
is the primary key in customer
and a foreign key in order
. Declaring customer_id
as a shard key in both tables is clear. This is coupled with the fact that the fictive workload includes predicate on customer ID. To include joins with order_item
, we need the shard key. The common approach is to denormalize order_item
by adding the customer_id
to that table. It’s also necessary to change the application’s read and write operations to order_item
.
Let’s include another factor in our example: a product structure. We use the tables product
, product_supplier
, product_category
, product_description
, and product_attributes
. These are different entities describing a different area than the customer-oriented tables, but at the same time are used by the application together with orders. Here we bring out a new approach: reference tables.
Shared tables are divided into chunks and the reference tables are not. They still exist on every shard node populated with the rows. This means that DML statements must go to the instances, making reference tables less suited for high-performance DML. Joins between a single shard and a reference table can be pushed down to that shard for highest performance.
In our example, the product structure isn’t updated frequently. We make the joins between orderline and product more efficient by populating a reference table daily with the required data from the product tables.
Running pgbench on the modified workload
Now that we better understand workloads and how statements can be executed efficiently through pushdown, we’re in a better place for another run. We have made these changes:
- The tables are sharded with the branch ID as the shard key
- Loading is complete
- Indexes are created and include the tables’ shard keys
- The workload has been changed to include a predicate on the shard key
The following is output from a pgbench
run with the modified workload:
We see an improvement from 518 TPS to 1,538 TPS. However, as we can see from the latency something isn’t right. Let’s take the next step to improve performance.
Testing guidelines for Aurora Limitless
The process of testing performance requires several best practices to provide reliable results. One best practice is to run the workload for a sufficient amount of time. The main reason for that is to make sure the database is warm and that pages are read from storage into memory to minimize unnecessary reads during the steady state. Another best practice is to reset the database to its initial state and to vacuum. The following code is a sample vacuum that also resets the tables:
The best practice is to warm up and run workloads for a long enough amount of time. This makes sure that as much data is in cache as possible. In the case of Aurora Limitless, there is another reason why we should follow this best practice.
Aurora Limitless grows and shrinks horizontally and vertically. A key component to the horizontal scaling is Amazon Aurora Serverless. A main characteristic with Serverless is that it scales up with higher workloads and scales down with lower workloads. Therefore, we need to make sure the buffer cache is warm, and allow Aurora Serverless to scale instances up.
Let’s see what happens if we increase the runtime from 10 minutes to 1 hour. The following is the pgbench
command and output:
In this example, pgbench
is run for 3,600 seconds (1 hour). In the output, we can see that thanks to prior runs and vacuum, it starts off at 320 TPS and gradually increases. It ends up over 2,400 TPS with a latency of 30 milliseconds. The average over 1 hour is 2,042 TPS and 49 milliseconds.
The growth in TPS reflects how Aurora Limitless responds to the workload by increasing the ACUs.
The following graph from Amazon CloudWatch shows how the shard group’s DBShardGroupCapacity
grows and shrinks.
We see the upwards slope, where Aurora Serverless is adding resources, among them CPU and memory. There is a steady state where the shard group has scaled up fully to the maximum ACU (where it plateaus).
ACU is a metric that tracks numerous resources, including CPU and memory. It’s tightly integrated into Aurora, including Amazon Aurora PostgreSQL-Compatible Edition and its memory management.
We set the maximum ACU when we create the shard group and can modify it as needed. It’s the maximum amount of resources available to the shard group. When the system maxes out, we can manually increase the maximum ACU to provide additional resources to the shard group.
Minimum ACU improves based on how quickly more resources are provided to the shard group, but at the same time incurs higher costs.
Topology
With the current ACU setting, the maximum ACU is close to 2,500 TPS, as shown in the following snippet from the previous run:
Aurora Limitless is based on two types of instances: routers and shards. Applications connect to routers through a DNS service. The purpose of routers is to distribute workloads by routing statements to shards and to execute statements that span multiple shards. Shards store and access data, executing the statements that the router has passed on. How many routers and how many shards depends on the shard group’s ACU setting. For our shard group, there are two routers:
The following graph shows the CPUUtilization
from CloudWatch for the two routers during the run with the modified workload.
The yellow line for one of the two routers is significant. The blue line for the other router is at the X-axis throughout the test, using almost no CPU. Therefore, when pgbench
connected to the endpoint, the connections were routed to the same router.
What we observe in the graph is that one of the routers is busy while the other router is more or less idle. This is a phenomenon that affects benchmarking and is less of an issue in the real world. In benchmarking, the tool (in this case pgbench
), is built to quickly establish connections. Therefore, connections get the same endpoint, to the same router.
For the most part, this isn’t an issue in real-world application. The exception would be where an application server requests many connections (hundreds or thousands) as fast as pgbench
and at almost the same time, for instances during startup. A slightly lower rate of connects would offset this, as would a Network Load Balancer.
To take full advantage of the shard group when benchmarking with tools such as pgbench
, run one pgbench
per router, each pgbench
connecting to a different endpoint. The following query shows the router endpoints:
Start one pgbench
session for each router, in this case totaling 1,200 users:
With the two pgbench
sessions running concurrently, this is a workload for 1,200 users totaling well over 7,000 TPS, counting the later reported lines in the output above. The averages over the entire 1-hour run is lower due to ramp-up.
Limitless pgbench
The focus of this post has largely been on pgbench
functionality with Aurora Limitless. Discussions have been pgbench specific and have described the impact on real-world applications. Let’s shift focus to high performance and for that purpose use a new shard group.
This test requires no particular configuration or settings in parameter groups, but it does require:
- significantly higher maxACU
- large driver instances
- balancing driver connections over all routers.
At the time of GA, maxACU is limited to 6,144 ACU and to increase the limit it takes a discussion with AWS. In the case the maxACU is set well above the GA limit and the minACU was increased to shorten the warmup time.
To drive the workload three c8g.48xlarge
instances were set up.
There are several options to balance connections over all routers. An earlier example in this post describes how to use one pgbench
session per router. A far better alternative is to use an NLB, which can be leveraged to not only balance connections but also to improve on availability. To demonstrate yet another alternative, let’s use functionality built into libpq
and pgbench
versions 16: load balance hosts.
To use the built in load balancing assign the list of router endpoints (as a comma separated list) to PGHOST
and to set PGLOADBALANCEHOSTS
to random:
This approach does not guarantee a perfect distribution of connections, but a randomized distribution. To double check use CloudWatch and the router metric DatabaseConnections
. This graph is from the test in this section of the post.
This graph shows the 163 to 217 connections per router. With a total of 6,000 connections over 32 routers a perfect distribution would have been 187.5 connections per router. So not a perfect distribution but this example shows how to use a simple way of distributing connections.
Lastly, for the sake of simplicity, lets use one of the tables from earlier with a simpler workload, updates of random rows. The table is now 100B rows, that is, scale 1,000,000, and here is the workload:
The command line is straightforward for 2,000 users per pgbench
session, running for ten minutes. PGHOST
and PGLOADBALANCEHOSTS
are set in the environment:
This is the complete run in each of the three driver instances:
Driver instance 1:
Driver instance 2:
Driver instance 3:
Summing up the transactions per second results in 2555636.945298 or >2.5M TPS. This at a latency of 2.35 ms with a latency standard deviation of less than half the latency.
Summary of pgbench
In this section, we explored how to benchmark Aurora Limitless using pgbench
, a tool originally optimized for single-node PostgreSQL databases. We demonstrated that at each step of the process—from table creation to data loading and query execution—adjustments are necessary to effectively use the distributed architecture of Aurora Limitless.
We covered the following key points:
- The importance of sharding tables and choosing the appropriate shard keys in Aurora Limitless
- How to create and load sharded tables for
pgbench
in Aurora Limitless - The need to adapt data loading strategies, moving from
pgbench
’s multi-row INSERT statements to a more distributed approach using multiple connections - The concept of table collocation and its significance in a sharded environment
- How to create indexes and primary keys in a sharded setup
Although pgbench
is highly optimized for single-node systems, Aurora Limitless offers new opportunities for scalability and performance. However, to fully utilize these benefits, both schemas and workloads must be adapted to suit a sharded database architecture.
Troubleshooting
If your statement isn’t being pushed down and run on a single shard, you should determine if this statement only touches rows with a particular shard key value and potentially reference tables. Consider the following:
- Are all tables in the statement sharded?
- Is the table a reference table?
- Does the statement include an equality predicate on the shard key?
- Are there tables with different shard keys in the same statement?
- Are there non-distributed functions involved?
Conclusion
Aurora Limitless enables you create a sharded solution with high performance and scalability. As a sharded solution, consider and implement sharding as well as changes to workloads.
In this post, we used the common tool pgbench
to show how single-node solutions might not perform as well as they can in a multi-node solution. We considered the following:
- Sharded schema – We deliberately selected shard keys and created tables with sharding and collocation.
- Reference tables – For a table that is mostly read-only with little or infrequent DML, a reference table is the right choice. This is an opportunity to make sure statements are pushed down to a shard.
- Data load – An initial load is typically done with the data load utility. Single-node optimized loads are slower on multi-node systems, and parallel sessions are better than a few heavyweight sessions.
- Indexing – To be unique, an index must include the shard key. A table’s primary key will typically amend the shard key in the primary key definition (and index).
- Workload– The workload includes an equality predicate on the shard key to make sure most of the workload is pushed down to shards. In addition, function signatures include the shard key and are configured to be distributed.
A well-sharded system should include the schema and workload changes we discussed in this post. These workloads can scale to the configured maximum ACU and provide performance well over what single-node systems can offer. Aurora offers operational simplicity, enabling you to focus on the business value.
About the author
Stefan Karlsson is a seasoned professional with more than three decades of experience, including two decades in DBMS performance engineering. His focus is, and has been, on customers, partners, and performance. In his spare time, he enjoys cooking and beating his daughter in video games.