AWS Database Blog

Benchmarking Amazon Aurora Limitless with pgbench

Amazon Aurora Limitless is a database solution that grows and shrinks vertically and horizontally with the current workload requirements. It’s a managed solution based on the Amazon Aurora infrastructure, and isn’t constrained by a single node’s limitations.

In this post, we show you how to test performance with the common tool pgbench. This tool is used with single-node database management systems (DBMS) and is optimized for single-node use cases. As we shall see in this post, this doesn’t mean that the tool measure what we think when it comes to multi-node systems. We demonstrate how it works with Aurora Limitless. We also discuss the obstacles and opportunities you might encounter when using this tool with Aurora Limitless.

Performance and scalability

Your organization might measure performance based on the following traits:

  • Latency– There are statements, specifically reads and writes, that must complete within a period of time.
  • Transaction rate– There is a need for a minimum transactions per second, per minute, per hour, or per day, or a peak rate within a period of time.
  • Scalability– There is a need to grow without built-in limitations.

Typically, performance is measured based on all three of these traits. What use is a high transaction rate unless users and applications get a response in a reasonable period of time? What use is good latency if it’s only when there are very few users? Scalability involves many moving parts and the purpose of scalability tests is often to measure the limit of immovable parts.

Scalability points to a larger picture that includes architecture. The different metrics, such as latency, can be a priority or a requirement.

Scalability should be tested well before applications outgrow the current solution. Tests give two important properties related to scalability: how the system behaves under more stress than with the real-world applications, and at what workload the system reaches a limit.

The following are some things to consider when testing performance:

  • Workload– The more complex the workload, the harder it usually is to run the workload, and understand the result. In contrast, the smaller the workload, the smaller the surface area tested. However, with simplicity comes ease of understanding.
  • Schema– The larger the schema, the more time required to set up, identify the workload, load, and reset. The smaller the schema, the more limited the choices for your workload, but the more precise the measurements and the stress points.
  • Scalability – This can be defined in various terms: number of users, number of connections, amounts of data, transaction rate, and maximum transaction rate. It’s important to focus on one or very few aspects of scalability early on.
  • Baseline– Lastly, you need a baseline to compare with. For example, consider a car that does 60 mph in 200 yards (or 100 kph in 200 meters): Is that good? Is that bad? To contrast that, many of us can assess how a car that does 60 mph in 6 seconds compares.

Determining the purpose of the exercise, schema, workload, scalability, and baseline are all important questions to ask before starting a scalability benchmark.

When scaling up, it’s important to understand what parameters are important to the business, the applications, and the users. It’s also important to understand what affects performance in the toolset and the architecture. A tool often used with PostgreSQL is pgbench. In the following sections, we explore using pgbench with Aurora Limitless.

Aurora Limitless terminology

In this post, we use the following terminology:

  • Sharding – A strategy to divide data over multiple nodes.
  • Shard key – The column or columns that data is divided along.
  • Collocation – Declaring that tables’ rows should be collocated to the same shards. Collocation speeds up joins between collocated tables.
  • Reference table – A type of table that exists in all shards with low Data Manipulation Language (DML) rates and typically static.
  • Aurora Capacity Units (ACUs) – A metric for the resources dedicated to a shard group. For faster responsiveness, increase the minimum ACU, and for a higher capacity ceiling, increase the maximum ACU.
  • Shard group – The collection of nodes in an Aurora Limitless system. The number changes with workload, data, and ACU setting.
  • Router – The nodes that receive requests and either run them or push them down to shards. The latter performs and scales better.
  • Shard – The nodes that stores and manages the data. They’re only accessible from the shard group’s routers.
  • Pushdown – The fastest and most scalable way to run statements in a sharded system. The term refers to that execution is “pushed down” to the shards. These statements have equality predicates on all tables’ shard keys, the tables have the same shard keys and for joins these tables are collocated. Reference tables may be included in the statements.

pgbench

pgbench is a popular tool for testing performance with PostgreSQL. The following is example code from a first run with Aurora Limitless:

$ ./pgbench --version
pgbench (15.2, server 16.4)
$ ./pgbench -i pgbench
dropping old tables...
NOTICE: table "pgbench_accounts" does not exist, skipping
DETAIL: Cannot drop limitless table pgbench_accounts because the table does not exist
NOTICE: table "pgbench_branches" does not exist, skipping
DETAIL: Cannot drop limitless table pgbench_branches because the table does not exist
NOTICE: table "pgbench_history" does not exist, skipping
DETAIL: Cannot drop limitless table pgbench_history because the table does not exist
NOTICE: table "pgbench_tellers" does not exist, skipping
DETAIL: Cannot drop limitless table pgbench_tellers because the table does not exist
creating tables...
generating data (client-side)...
100000 of 100000 tuples (100%) done (elapsed 0.06 s, remaining 0.00 s)
ERROR: cannot perform COPY FREEZE because the table was not created or truncated in the current subtransaction
pgbench: error: PQendcopy failed
$

pgbench generates data on the client and then writes to the DBMS:

copy pgbench_accounts from stdin with (freeze on)

This doesn’t apply to all versions of pgbench; some versions are optimized for loading for single-node PostgreSQL by employing COPY...FREEZE. That optimization fails due to a documented limitation with partitioned tables. For more information, see the official documentation on the COPY command.

Even thought the load fails the tables are created. Let’s see how they’re defined:

pgbench=> select * from rds_aurora.limitless_tables;
table_gid  | local_oid | schema_name |    table_name    | table_status | table_type | distribution_key
-----------+-----------+-------------+------------------+--------------+------------+------------------
         7 |    143044 | public      | pgbench_accounts | active       | standard   |
         8 |    143047 | public      | pgbench_branches | active       | standard   |
         5 |    143038 | public      | pgbench_history  | active       | standard   |
         6 |    143041 | public      | pgbench_tellers  | active       | standard   |
(4 rows)

Let’s compare that with a pgbench database created for Aurora Limitless:

pgbench_limitless=> select * from rds_aurora.limitless_tables;
 table_gid | local_oid | schema_name |     table_name     | table_status | table_type | distribution_key
-----------+-----------+-------------+--------------------+--------------+------------+------------------
     16000 |     25206 | public      | pgbench_accounts   | active       | sharded    | HASH (bid)
     16001 |     25229 | public      | pgbench_branches   | active       | sharded    | HASH (bid)
     16002 |     25247 | public      | pgbench_history    | active       | sharded    | HASH (bid)
     16003 |     25265 | public      | pgbench_tellers    | active       | sharded    | HASH (bid)
(4 rows)

The important difference is the column table_type in the preceding code examples. In the first case, the type is standard and in the second case, the type is sharded. In the next section, we discuss what sharding is and how to create sharded tables.

Sharding

Sharding works by dividing up data into different chunks called shards. Each shard is then stored in a separate instance. This allows for greater scalability because reads and writes go to multiple instances. The strategy can be employed for different purposes and, because Aurora Limitless is an online transaction processing (OLTP) solution, the aim is to allow for a higher number of reads and writes. We declare a shard key and submit workloads, and Aurora Limitless divides the data and routes the statements.

Because Aurora Limitless must be able to tell how to route statements, we need to declare a shard key. The key, comprised of one or more columns, is used to determine in which instance rows with particular values are stored (or, in the case of INSERT, where to store them). Tables with the same shard key should be declared as collocated, so rows in different tables but with the same shard key are stored in the same shard.

Sharding and pgbench

The pgbench default schema is inspired by the TPC-B benchmark, simulating a banking system in four tables: branches, tellers, accounts, and history, with many tellers per branch, and many accounts per branch. The primary key follows the tables: branch_id, teller_id, and account_id. The relationships are modeled by having branch_id as foreign keys in tellers and in accounts. Each transaction is inserted into the history table.

Typically, a combination of the schema and transactions provides the shard key. In this case, the clear choice is branch_id because it’s in all four tables.

To shard tables during creation, we set the table mode as sharded and the shard key for only one column, branch_id or bid.

The branch ID is the primary key for the table pgbench_branches, and we want rows from all sharded tables to be collocated with the rows in pgbench_branches, all having the same value for shard key bid.

This last option goes after the table pgbench_branches is created, and therefore Aurora Limitless knows that the tables are collocated, which makes joins more efficient. The following snippet shows the runnable sequence of commands to create sharded and collocated tables:

$ cat limitless_create_tables.sql
SET rds_aurora.limitless_create_table_mode='sharded';
SET rds_aurora.limitless_create_table_shard_key='{"bid"}';
CREATE TABLE pgbench_branches (
bid integer NOT NULL,
bbalance integer,
filler character(88)
);
SET rds_aurora.limitless_create_table_collocate_with='pgbench_branches';
CREATE TABLE pgbench_tellers (
tid integer NOT NULL,
bid integer,
tbalance integer,
filler character(84)
);
CREATE TABLE pgbench_accounts (
aid bigint NOT NULL,
bid integer,
abalance integer,
filler character(84)
);
CREATE TABLE pgbench_history (
tid integer,
bid integer,
aid bigint,
delta integer,
mtime timestamp without time zone,
filler character(22)
);

After sharding, a SELECT from rds_aurora.limitless_tables will show them as sharded:

pgbench_limitless=> select * from rds_aurora.limitless_tables;
 table_gid | local_oid | schema_name |     table_name     | table_status | table_type | distribution_key
-----------+-----------+-------------+--------------------+--------------+------------+------------------
     16003 |     22430 | public      | pgbench_accounts   | active       | sharded    | HASH (bid)
     16002 |     22409 | public      | pgbench_branches   | active       | sharded    | HASH (bid)
     16004 |     22451 | public      | pgbench_history    | active       | sharded    | HASH (bid)
     16005 |     22473 | public      | pgbench_tellers    | active       | sharded    | HASH (bid)
(4 rows)

The system view rds_aurora.limitless_table_collocations shows collocated tables:

pgbench_limitless=> SELECT * FROM rds_aurora.limitless_table_collocations ORDER BY collocation_id;
 collocation_id | schema_name |    table_name
----------------+-------------+------------------
          16000 | public      | pgbench_branches
          16000 | public      | pgbench_accounts
          16000 | public      | pgbench_history
          16000 | public      | pgbench_tellers
(4 rows)

Creating indexes

Indexes can be created before or after the tables are loaded. Indexes are local, so for sharded tables, unique indexes must include the shard key. This goes for both unique and primary key constraints. This affects what workloads—statements and transactions—are run efficiently.

For the actual command, this means making a minor change to add a constraint. Note that pgbench_history is not and usually should not be indexed:

$ cat limitless_creidx.psql
alter table pgbench_branches add primary key (bid);
alter table pgbench_tellers add primary key (tid, bid);
alter table pgbench_accounts add primary key (aid, bid);

A common best practice is that the more selective column is the first column in the index. Because there are 10 tellers per branch, and 100,000 accounts per branch, bid is less selective than tid and aid.

Loading data

Before we resolve loading pgbench data, let’s look at an example of the statements issued by pgbench. First, let’s explore what the Aurora Limitless implementation of pg_stat_statements says about the data load:

pgbench_limitless=> CREATE EXTENSION pg_stat_statements;
pgbench_limitless=> select subcluster_id, subcluster_type, sso_calls, query, calls from rds_aurora.limitless_stat_statements order by calls desc limit 3;
subcluster_id | subcluster_type | sso_calls | query                                                                                   [...]                           | calls
--------------+-----------------+-----------+-------------------------------------------------------------------------------------------------------------------------+----------
13            | shard           |           | INSERT INTO public.pgbench_accounts(aid, bid, abalance, filler) VALUES ($1, $2, $3, $4),[...], ($397, $398, $399, $400) | 12622000
6             | shard           |           | INSERT INTO public.pgbench_accounts(aid, bid, abalance, filler) VALUES ($1, $2, $3, $4),[...], ($397, $398, $399, $400) | 12597000
10            | shard           |           | INSERT INTO public.pgbench_accounts(aid, bid, abalance, filler) VALUES ($1, $2, $3, $4),[...], ($397, $398, $399, $400) | 12554000
. . .

As we can see, the INSERT statement has a number—100—of row constructors, each with four placeholders. (For readability, the ellipses replace a number of row constructors and we will return to the column sso_calls.) This is a common design pattern to improve performance of a single-node DBMS such as PostgreSQL. For Aurora Limitless, a different design pattern will improve performance, particularly scalability.

Design pattern for application insert and loading

There are different scenarios where inserts or data loads are optimized. This section discusses how applications insert data, not the initial load or bulk operations.

For single nodes, the common design pattern is to send a large number of rows in one step, as shown in the preceding statements. This design pattern and its gains don’t manifest in Aurora Limitless because each row must be evaluated. Aurora Limitless need to unpack the rows to determine to which instance the INSERT statement should be routed. Then it transmits each statement to the right instance while maintaining transactional consistency. This is the reason why pgbench load is slow compared to a single-node DBMS.

For Aurora Limitless, it’s better to use many connections. Each connection submits one row to insert at a time. Then each row gets pushed to the instance where the row should reside. This can be 10, 100, or 1,000 connections—Aurora Limitless scales according to need.

Loading a pgbench database

There are several ways of loading pgbench data, some faster, some more time-consuming. A simple way of loading pgbench data in Aurora Limitless uses SQL and SQL functions. The scale parameter sets the proportions of the tables, so scale 10000 means that pgbench_branches will have 10,000 rows, pgbench_tellers will have 100,000 rows, and pgbench_accounts will have 1 billion rows. The following script loads the three main tables (pgbench_history starts empty):

$ cat limitless_pgbench_load.psql
\set scale 10000
insert into pgbench_branches(bid,bbalance)
  select bid, 0
  from generate_series(1, :scale) as bid;
insert into pgbench_tellers(tid,bid,tbalance)
  select tid, (tid - 1) / 10 + 1, 0
  from generate_series(1, 10 * :scale) as tid;
insert into pgbench_accounts(aid,bid,abalance,filler)
  select aid, (aid - 1) / 100000 + 1, 0, ''
  from generate_series(1, 100000 * :scale) as aid;
-- vacuum analyze
vacuum analyze pgbench_branches;
vacuum analyze pgbench_tellers;
vacuum analyze pgbench_accounts;
vacuum analyze pgbench_history;
$

This results in the following:

pgbench_limitless=> SELECT COUNT(*) FROM pgbench_branches;
 count
-------
 10000
(1 row)

pgbench_limitless=> SELECT COUNT(*) FROM pgbench_tellers;
 count
--------
 100000
(1 row)

pgbench_limitless=> SELECT COUNT(*) FROM pgbench_accounts;
   count
------------
 1000000000
(1 row)

pgbench_limitless=> select subcluster_id , subcluster_type , db_size ,pg_size_pretty(db_size) AS Database_Size
pgbench_limitless-> from rds_aurora.limitless_stat_database_size('pgbench_limitless');
 subcluster_id | subcluster_type |   db_size   | database_size
---------------+-----------------+-------------+---------------
 2             | router          |    12926947 | 12 MB
 5             | shard           | 42500177891 | 40 GB
 3             | router          |    12780047 | 12 MB
 7             | shard           | 43964325859 | 41 GB
 4             | shard           | 42221371363 | 39 GB
 6             | shard           | 41851830243 | 39 GB
(6 rows)

pgbench_limitless=> SELECT subcluster_id
pgbench_limitless->         , subcluster_type
pgbench_limitless->         , pg_size_pretty(total_size)  as total_size
pgbench_limitless->         , pg_size_pretty(table_size) as  table_size
pgbench_limitless->         , pg_size_pretty(indexes_size) as indexes_size
pgbench_limitless->         , pg_size_pretty(toast_size) as toast_size
pgbench_limitless-> FROM rds_aurora.limitless_stat_relation_sizes('public','pgbench_accounts')
pgbench_limitless-> ORDER BY total_size desc;
 subcluster_id | subcluster_type | total_size | table_size | indexes_size | toast_size
---------------+-----------------+------------+------------+--------------+------------
 7             | shard           | 41 GB      | 32 GB      | 9297 MB      | 0 bytes
 5             | shard           | 39 GB      | 32 GB      | 8080 MB      | 0 bytes
 4             | shard           | 39 GB      | 31 GB      | 8088 MB      | 0 bytes
 6             | shard           | 39 GB      | 31 GB      | 7963 MB      | 0 bytes
 3             | router          | 0 bytes    | 0 bytes    | 0 bytes      | 0 bytes
 2             | router          | 0 bytes    | 0 bytes    | 0 bytes      | 0 bytes
(6 rows)

Tables are created, sharded, indexed, and loaded. Then it’s time for vacuum:

$ cat vacuum.psql
vacuum analyze pgbench_branches;
vacuum analyze pgbench_tellers;
vacuum analyze pgbench_accounts;
vacuum analyze pgbench_history;

A comment on scale

Scale 10,000 means that pgbench_branches has 10,000 rows. This places a limit on the number of transactions since there will be concurrency issues. So, chose scaling factor advisedly taking the target throughput into account, and monitor wait events.

Running pgbench

We have created the tables, sharded, and collocated them. The tables are indexed and the indexes include the shard key. The data is loaded and the tables are vacuumed. Now we’re ready to try pgbench on Aurora Limitless. The following code illustrates a 10-minute run with 100 users:

$ pgbench -f pgbench_tpcb.sql -c 100 -j 100 -n -P 60 -T 600
pgbench (15.2, server 16.4)
progress: 60.0 s, 376.9 tps, lat 261.750 ms stddev 240.250, 0 failed
progress: 120.0 s, 414.7 tps, lat 241.119 ms stddev 224.369, 0 failed
progress: 180.0 s, 448.9 tps, lat 222.852 ms stddev 222.388, 0 failed
progress: 240.0 s, 477.2 tps, lat 209.551 ms stddev 214.490, 0 failed
progress: 300.0 s, 511.5 tps, lat 195.502 ms stddev 192.859, 0 failed
progress: 360.0 s, 519.9 tps, lat 189.697 ms stddev 204.396, 0 failed
progress: 420.0 s, 574.4 tps, lat 174.318 ms stddev 198.457, 0 failed
progress: 480.0 s, 595.5 tps, lat 170.055 ms stddev 199.635, 0 failed
progress: 540.0 s, 596.7 tps, lat 167.573 ms stddev 201.789, 0 failed
progress: 600.0 s, 664.2 tps, lat 150.011 ms stddev 167.801, 0 failed
transaction type: ./pgbench_tpcb.sql
scaling factor: 10000
query mode: simple
number of clients: 100
number of threads: 100
maximum number of tries: 1
duration: 600 s
number of transactions actually processed: 310880
number of failed transactions: 0 (0.000%)
latency average = 192.968 ms
latency stddev = 207.617 ms
initial connection time = 856.458 ms
tps = 517.896598 (without initial connection time)

We can now determine if this is what we expected from pgbench and Aurora Limitless.

Sharding and workload changes

The result from our first run isn’t what we expected. Let’s see what happened.

The transaction (singular, because there is only one type of transaction) is as follows:

$ cat pgbench_tpcb.sql
\set scale 10000
\set aid random(1, 100000 * :scale)
\set bid random(1, 1 * :scale)
\set tid random(1, 10 * :scale)
\set delta random(-5000, 5000)
BEGIN;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
END;

In the following code, we can see how this transaction is reflected in rds_aurora.limitless_stat_statements:

postgres_limitless=> CREATE EXTENSION pg_stat_statements;
postgres_limitless=> select subcluster_id, subcluster_type, sso_calls, query, calls from rds_aurora.limitless_stat_statements order by calls desc limit 4;
 subcluster_id | subcluster_type | sso_calls |                                       query                                        | calls
---------------+-----------------+-----------+------------------------------------------------------------------------------------+-------
 3             | router          |         0 | UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2                 | 67651
 3             | router          |         0 | SELECT abalance FROM pgbench_accounts WHERE aid = $1                               | 67651
 3             | router          |         0 | UPDATE pgbench_accounts SET abalance = abalance + $1 WHERE aid = $2                | 67651
 3             | router          |     67651 | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2                | 67651
(4 rows)

Checking the column calls, we can see that the statements are run the same number of times. This is as expected.

But what about the column sso_calls, and what does that mean?

Statements can be executed on the router or pushed down to a shard for execution. The former is a default strategy that can be thought of as bringing data to the router and then running the operator. The latter performs much better by moving the processing to the data, and there are numerous optimizations around single-shard optimized execution, which is what sso_calls means.

For one update, the update on pgbench_branches, we can see that for all executions – the column calls are single-shard optimized because the values in calls are equal to the number in sso_calls. The other two UPDATE statements and the SELECT are done by sending the statements to all shards.

We can achieve single-shard optimized execution including the shard key in the WHERE clause. This is true for pgbench_branches because bid is both the shard key and the table’s primary key.

As we can see in the preceding UPDATE statements on the tables, pgbench_tellers and pgbench_accounts don’t include predicates on the shard key. Therefore, they won’t be executed as efficiently as they could be.

Aurora Limitless can’t push statements down to the data unless it knows the shard key. Therefore, we need to check the workload and include an equality predicate on the shard key.

For pgbench, we need to make changes to the actual workload—in this case, add a predicate on bid to the UPDATE statements on pgbench_accounts and pgbench_tellers. The following is the original statement on table pgbench_accounts:

UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;

In pgbench scripts, this is straightforward to change, especially because bid in the variable :bid is already generated by the driver:

UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid AND bid = :bid;

The complete script for running a pgbench standard workload on Aurora Limitless adds an extra predicate on the shard key:

$ cat limitless_pgbench_tpcb.sql
\set scale 10000
\set aid random(1, 100000 * :scale)
\set bid random(1, 1 * :scale)
\set tid random(1, 10 * :scale)
\set delta random(-5000, 5000)
BEGIN;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid AND bid = :bid;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid AND bid = :bid;
UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid AND bid = :bid;
UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
END;

Before we see the difference between the original and the modified versions, let’s consider the impact on real-world applications.

Design patterns for sharded databases

For real-world applications, this is sometimes more complex; the following are some examples based on the common customer, order, and order_item table structure.

In this case, customer_id is the primary key in customer and a foreign key in order. Declaring customer_id as a shard key in both tables is clear. This is coupled with the fact that the fictive workload includes predicate on customer ID. To include joins with order_item, we need the shard key. The common approach is to denormalize order_item by adding the customer_id to that table. It’s also necessary to change the application’s read and write operations to order_item.

Let’s include another factor in our example: a product structure. We use the tables product, product_supplier, product_category, product_description, and product_attributes. These are different entities describing a different area than the customer-oriented tables, but at the same time are used by the application together with orders. Here we bring out a new approach: reference tables.

Shared tables are divided into chunks and the reference tables are not. They still exist on every shard node populated with the rows. This means that DML statements must go to the instances, making reference tables less suited for high-performance DML. Joins between a single shard and a reference table can be pushed down to that shard for highest performance.

In our example, the product structure isn’t updated frequently. We make the joins between orderline and product more efficient by populating a reference table daily with the required data from the product tables.

Running pgbench on the modified workload

Now that we better understand workloads and how statements can be executed efficiently through pushdown, we’re in a better place for another run. We have made these changes:

  • The tables are sharded with the branch ID as the shard key
  • Loading is complete
  • Indexes are created and include the tables’ shard keys
  • The workload has been changed to include a predicate on the shard key

The following is output from a pgbench run with the modified workload:

pgbench (15.2, server 16.4)
progress: 60.0 s, 1338.0 tps, lat 552.224 ms stddev 253.185, 0 failed
progress: 120.0 s, 1440.8 tps, lat 556.205 ms stddev 1316.137, 0 failed
progress: 180.0 s, 1519.6 tps, lat 526.476 ms stddev 233.708, 0 failed
progress: 240.0 s, 1459.4 tps, lat 548.314 ms stddev 132.402, 0 failed
progress: 300.0 s, 1504.2 tps, lat 530.594 ms stddev 140.465, 0 failed
progress: 360.0 s, 1481.8 tps, lat 541.122 ms stddev 344.579, 0 failed
progress: 420.0 s, 1652.2 tps, lat 483.661 ms stddev 283.605, 0 failed
progress: 480.0 s, 1649.8 tps, lat 484.780 ms stddev 276.784, 0 failed
progress: 540.0 s, 1667.3 tps, lat 480.449 ms stddev 286.714, 0 failed
transaction type: ./limitless_pgbench_tpcb.sql
scaling factor: 10000
query mode: simple
number of clients: 800
number of threads: 800
maximum number of tries: 1
duration: 600 s
number of transactions actually processed: 916644
number of failed transactions: 0 (0.000%)
latency average = 519.629 ms
latency stddev = 468.349 ms
initial connection time = 4971.928 ms
tps = 1538.705516 (without initial connection time)

We see an improvement from 518 TPS to 1,538 TPS. However, as we can see from the latency something isn’t right. Let’s take the next step to improve performance.

Testing guidelines for Aurora Limitless

The process of testing performance requires several best practices to provide reliable results. One best practice is to run the workload for a sufficient amount of time. The main reason for that is to make sure the database is warm and that pages are read from storage into memory to minimize unnecessary reads during the steady state. Another best practice is to reset the database to its initial state and to vacuum. The following code is a sample vacuum that also resets the tables:

$ cat vacuum.sql
UPDATE pgbench_branches SET bbalance = 0 WHERE bbalance <> 0;
UPDATE pgbench_tellers SET tbalance = 0 WHERE tbalance <> 0;
UPDATE pgbench_accounts SET abalance = 0 WHERE abalance <> 0;
TRUNCATE TABLE pgbench_history;
vacuum analyze pgbench_branches;
vacuum analyze pgbench_tellers;
vacuum analyze pgbench_accounts;
vacuum analyze pgbench_history;

The best practice is to warm up and run workloads for a long enough amount of time. This makes sure that as much data is in cache as possible. In the case of Aurora Limitless, there is another reason why we should follow this best practice.

Aurora Limitless grows and shrinks horizontally and vertically. A key component to the horizontal scaling is Amazon Aurora Serverless. A main characteristic with Serverless is that it scales up with higher workloads and scales down with lower workloads. Therefore, we need to make sure the buffer cache is warm, and allow Aurora Serverless to scale instances up.

Let’s see what happens if we increase the runtime from 10 minutes to 1 hour. The following is the pgbench command and output:

$ pgbench -f limitless_pgbench_tpcb.sql -c 100 -j 100 -s 10000 -n -P 60 -T 3600
pgbench (15.2, server 16.4)
progress: 60.0 s, 319.8 tps, lat 304.093 ms stddev 419.914, 0 failed
progress: 120.0 s, 540.6 tps, lat 185.089 ms stddev 134.676, 0 failed
progress: 180.0 s, 696.0 tps, lat 143.701 ms stddev 47.232, 0 failed
progress: 240.0 s, 862.2 tps, lat 116.025 ms stddev 28.890, 0 failed
progress: 300.0 s, 964.3 tps, lat 103.700 ms stddev 45.325, 0 failed
progress: 360.0 s, 935.7 tps, lat 106.872 ms stddev 131.577, 0 failed
progress: 420.0 s, 1002.0 tps, lat 99.809 ms stddev 125.449, 0 failed
. . .
progress: 3240.0 s, 2480.5 tps, lat 40.313 ms stddev 3.903, 0 failed
progress: 3300.0 s, 2450.0 tps, lat 40.815 ms stddev 24.320, 0 failed
progress: 3360.0 s, 2451.1 tps, lat 40.797 ms stddev 24.491, 0 failed
progress: 3420.0 s, 2487.9 tps, lat 40.193 ms stddev 3.796, 0 failed
progress: 3480.0 s, 2489.5 tps, lat 40.168 ms stddev 3.702, 0 failed
progress: 3540.0 s, 2457.2 tps, lat 40.696 ms stddev 24.058, 0 failed
progress: 3600.0 s, 2484.5 tps, lat 40.246 ms stddev 4.268, 0 failed
transaction type: ./limitless_pgbench_tpcb.sql
scaling factor: 10000
query mode: simple
number of clients: 100
number of threads: 100
maximum number of tries: 1
duration: 3600 s
number of transactions actually processed: 7347124
number of failed transactions: 0 (0.000%)
latency average = 48.977 ms
latency stddev = 61.903 ms
initial connection time = 1544.353 ms
tps = 2041.690066 (without initial connection time)

In this example, pgbench is run for 3,600 seconds (1 hour). In the output, we can see that thanks to prior runs and vacuum, it starts off at 320 TPS and gradually increases. It ends up over 2,400 TPS with a latency of 30 milliseconds. The average over 1 hour is 2,042 TPS and 49 milliseconds.

The growth in TPS reflects how Aurora Limitless responds to the workload by increasing the ACUs.

The following graph from Amazon CloudWatch shows how the shard group’s DBShardGroupCapacity grows and shrinks.

We see the upwards slope, where Aurora Serverless is adding resources, among them CPU and memory. There is a steady state where the shard group has scaled up fully to the maximum ACU (where it plateaus).

ACU is a metric that tracks numerous resources, including CPU and memory. It’s tightly integrated into Aurora, including Amazon Aurora PostgreSQL-Compatible Edition and its memory management.

We set the maximum ACU when we create the shard group and can modify it as needed. It’s the maximum amount of resources available to the shard group. When the system maxes out, we can manually increase the maximum ACU to provide additional resources to the shard group.

Minimum ACU improves based on how quickly more resources are provided to the shard group, but at the same time incurs higher costs.

Topology

With the current ACU setting, the maximum ACU is close to 2,500 TPS, as shown in the following snippet from the previous run:

progress: 3240.0 s, 2480.5 tps, lat 40.313 ms stddev 3.903, 0 failed
progress: 3300.0 s, 2450.0 tps, lat 40.815 ms stddev 24.320, 0 failed
progress: 3360.0 s, 2451.1 tps, lat 40.797 ms stddev 24.491, 0 failed
progress: 3420.0 s, 2487.9 tps, lat 40.193 ms stddev 3.796, 0 failed
progress: 3480.0 s, 2489.5 tps, lat 40.168 ms stddev 3.702, 0 failed
progress: 3540.0 s, 2457.2 tps, lat 40.696 ms stddev 24.058, 0 failed
progress: 3600.0 s, 2484.5 tps, lat 40.246 ms stddev 4.268, 0 failed

Aurora Limitless is based on two types of instances: routers and shards. Applications connect to routers through a DNS service. The purpose of routers is to distribute workloads by routing statements to shards and to execute statements that span multiple shards. Shards store and access data, executing the statements that the router has passed on. How many routers and how many shards depends on the shard group’s ACU setting. For our shard group, there are two routers:

postgres_limitless=> CREATE EXTENSION pg_stat_statements;
postgres_limitless=> SELECT subcluster_id, subcluster_type FROM rds_aurora.limitless_stat_statements_info;
 subcluster_id | subcluster_type
---------------+-----------------
 3             | router
 2             | router
 5             | shard
 7             | shard
 4             | shard
 6             | shard
(6 rows)

The following graph shows the CPUUtilization from CloudWatch for the two routers during the run with the modified workload.

The yellow line for one of the two routers is significant. The blue line for the other router is at the X-axis throughout the test, using almost no CPU. Therefore, when pgbench connected to the endpoint, the connections were routed to the same router.

What we observe in the graph is that one of the routers is busy while the other router is more or less idle. This is a phenomenon that affects benchmarking and is less of an issue in the real world. In benchmarking, the tool (in this case pgbench), is built to quickly establish connections. Therefore, connections get the same endpoint, to the same router.

For the most part, this isn’t an issue in real-world application. The exception would be where an application server requests many connections (hundreds or thousands) as fast as pgbench and at almost the same time, for instances during startup. A slightly lower rate of connects would offset this, as would a Network Load Balancer.

To take full advantage of the shard group when benchmarking with tools such as pgbench, run one pgbench per router, each pgbench connecting to a different endpoint. The following query shows the router endpoints:

pgbench_limitless=> select * from aurora_limitless_router_endpoints();
 subcluster_id | router_endpoint                        | load
---------------+----------------------------------------+------
 2             | cluster-router1-endpoint.amazonaws.com | 0.47
 3             | cluster-router2-endpoint.amazonaws.com | 0.22
(2 rows)

Start one pgbench session for each router, in this case totaling 1,200 users:

$ pgbench -f limitless_pgbench_tpcb.sql -c 600 -j 600 -s 10000 -n -P 60 -T 3600 -h cluster-router1-endpoint.amazonaws.com  >& single-shard_60mins_router1.out &
pgbench (15.2, server 16.4)
progress: 60.0 s, 1568.9 tps, lat 352.796 ms stddev 241.762, 0 failed
progress: 120.0 s, 1287.1 tps, lat 460.013 ms stddev 437.582, 0 failed
progress: 180.0 s, 1122.3 tps, lat 541.893 ms stddev 511.414, 0 failed
progress: 240.0 s, 1229.3 tps, lat 489.162 ms stddev 325.743, 0 failed
progress: 300.0 s, 1306.4 tps, lat 455.785 ms stddev 333.931, 0 failed
. . .
progress: 3360.0 s, 3499.5 tps, lat 171.487 ms stddev 87.637, 0 failed
progress: 3420.0 s, 3485.1 tps, lat 172.157 ms stddev 82.596, 0 failed
progress: 3480.0 s, 3487.3 tps, lat 172.029 ms stddev 85.558, 0 failed
progress: 3540.0 s, 3473.3 tps, lat 172.686 ms stddev 82.973, 0 failed
progress: 3600.0 s, 3486.0 tps, lat 172.183 ms stddev 74.917, 0 failed
transaction type: ./limitless_pgbench_tpcb.sql
scaling factor: 10000
query mode: simple
number of clients: 600
number of threads: 600
maximum number of tries: 1
duration: 3600 s
number of transactions actually processed: 10189618
number of failed transactions: 0 (0.000%)
latency average = 211.742 ms
latency stddev = 185.611 ms
initial connection time = 4176.875 ms
tps = 2833.443621 (without initial connection time)
$ pgbench -f limitless_pgbench_tpcb.sql -c 600 -j 600 -s 10000 -n -P 60 -T 3600 -h cluster-router2-endpoint.amazonaws.com >& single-shard_60mins_router2.out &
pgbench (15.2, server 16.4)
progress: 60.0 s, 645.5 tps, lat 847.349 ms stddev 827.334, 0 failed
progress: 120.0 s, 803.9 tps, lat 746.848 ms stddev 599.398, 0 failed
progress: 180.0 s, 1070.9 tps, lat 560.943 ms stddev 320.656, 0 failed
progress: 240.0 s, 1103.6 tps, lat 539.849 ms stddev 376.822, 0 failed
progress: 300.0 s, 1241.4 tps, lat 485.111 ms stddev 334.098, 0 failed
. . .
progress: 3360.0 s, 3512.0 tps, lat 170.844 ms stddev 84.294, 0 failed
progress: 3420.0 s, 3512.0 tps, lat 170.903 ms stddev 86.151, 0 failed
progress: 3480.0 s, 3508.2 tps, lat 170.961 ms stddev 78.208, 0 failed
progress: 3540.0 s, 3757.2 tps, lat 159.755 ms stddev 72.476, 0 failed
progress: 3600.0 s, 4243.2 tps, lat 141.380 ms stddev 148.445, 0 failed
transaction type: ./limitless_pgbench_tpcb.sql
scaling factor: 10000
query mode: simple
number of clients: 600
number of threads: 600
maximum number of tries: 1
duration: 3600 s
number of transactions actually processed: 10366355
number of failed transactions: 0 (0.000%)
latency average = 208.086 ms
latency stddev = 192.865 ms
initial connection time = 4969.839 ms
tps = 2883.246947 (without initial connection time)

With the two pgbench sessions running concurrently, this is a workload for 1,200 users totaling well over 7,000 TPS, counting the later reported lines in the output above. The averages over the entire 1-hour run is lower due to ramp-up.

Limitless pgbench

The focus of this post has largely been on pgbench functionality with Aurora Limitless. Discussions have been pgbench specific and have described the impact on real-world applications. Let’s shift focus to high performance and for that purpose use a new shard group.

This test requires no particular configuration or settings in parameter groups, but it does require:

  • significantly higher maxACU
  • large driver instances
  • balancing driver connections over all routers.

At the time of GA, maxACU is limited to 6,144 ACU and to increase the limit it takes a discussion with AWS. In the case the maxACU is set well above the GA limit and the minACU was increased to shorten the warmup time.

To drive the workload three c8g.48xlarge instances were set up.

There are several options to balance connections over all routers. An earlier example in this post describes how to use one pgbench session per router. A far better alternative is to use an NLB, which can be leveraged to not only balance connections but also to improve on availability. To demonstrate yet another alternative, let’s use functionality built into libpq and pgbench versions 16: load balance hosts.

To use the built in load balancing assign the list of router endpoints (as a comma separated list) to PGHOST and to set PGLOADBALANCEHOSTS to random:

PGHOST=”router1-endpoint,router2-endpoint,...”
PGLOADBALANCEHOSTS=random
export PGHOST PGLOADBALANCEHOSTS

This approach does not guarantee a perfect distribution of connections, but a randomized distribution. To double check use CloudWatch and the router metric DatabaseConnections. This graph is from the test in this section of the post.

This graph shows the 163 to 217 connections per router. With a total of 6,000 connections over 32 routers a perfect distribution would have been 187.5 connections per router. So not a perfect distribution but this example shows how to use a simple way of distributing connections.

Lastly, for the sake of simplicity, lets use one of the tables from earlier with a simpler workload, updates of random rows. The table is now 100B rows, that is, scale 1,000,000, and here is the workload:

$ cat limitless_pgbench_update.sql
\set scale 1000000
\set aid random(1, 100000 * :scale)
\set bid random(1, 1 * :scale)
\set delta random(-5000, 5000)
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid AND bid = :bid;

The command line is straightforward for 2,000 users per pgbench session, running for ten minutes. PGHOST and PGLOADBALANCEHOSTS are set in the environment:

pgbench -f ./limitless_pgbench_update.sql -c 2000 -j 2000 -s 1000000 -n -P 60 -T 600

This is the complete run in each of the three driver instances:

Driver instance 1:

$ pgbench -f ./limitless_pgbench_update.sql -c 2000 -j 2000 -s 1000000 -n -P 60 -T 600
pgbench (16.4)
progress: 60.0 s, 686401.9 tps, lat 2.357 ms stddev 0.639, 0 failed
progress: 120.0 s, 849197.9 tps, lat 2.355 ms stddev 0.639, 0 failed
progress: 180.0 s, 848990.5 tps, lat 2.356 ms stddev 0.660, 0 failed
progress: 240.0 s, 849730.9 tps, lat 2.354 ms stddev 0.649, 0 failed
progress: 300.0 s, 836900.8 tps, lat 2.390 ms stddev 2.488, 0 failed
progress: 360.0 s, 847194.7 tps, lat 2.361 ms stddev 1.471, 0 failed
progress: 420.0 s, 848091.6 tps, lat 2.358 ms stddev 0.651, 0 failed
progress: 480.0 s, 846166.1 tps, lat 2.364 ms stddev 1.596, 0 failed
progress: 540.0 s, 850282.2 tps, lat 2.352 ms stddev 0.655, 0 failed
progress: Infinity s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
transaction type: ./limitless_pgbench_update.sql
scaling factor: 1000000
query mode: simple
number of clients: 2000
number of threads: 2000
maximum number of tries: 1
duration: 600 s
number of transactions actually processed: 499780303
number of failed transactions: 0 (0.000%)
latency average = 2.355 ms
latency stddev = 1.228 ms
initial connection time = 11526.511 ms
tps = 848475.083080 (without initial connection time)

Driver instance 2:

$: pgbench -f ./limitless_pgbench_update.sql -c 2000 -j 2000 -s 1000000 -n -P 60 -T 600
pgbench (16.4)
progress: 60.0 s, 732174.1 tps, lat 2.206 ms stddev 0.701, 0 failed
progress: 120.0 s, 897809.3 tps, lat 2.228 ms stddev 0.675, 0 failed
progress: 180.0 s, 896915.5 tps, lat 2.230 ms stddev 0.686, 0 failed
progress: 240.0 s, 898754.1 tps, lat 2.225 ms stddev 0.674, 0 failed
progress: 300.0 s, 883947.6 tps, lat 2.262 ms stddev 2.486, 0 failed
progress: 360.0 s, 893172.1 tps, lat 2.239 ms stddev 1.469, 0 failed
progress: 420.0 s, 895048.2 tps, lat 2.234 ms stddev 0.674, 0 failed
progress: 480.0 s, 892966.3 tps, lat 2.240 ms stddev 1.598, 0 failed
progress: 540.0 s, 897730.2 tps, lat 2.228 ms stddev 0.681, 0 failed
progress: Infinity s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
transaction type: ./limitless_pgbench_update.sql
scaling factor: 1000000
query mode: simple
number of clients: 2000
number of threads: 2000
maximum number of tries: 1
duration: 600 s
number of transactions actually processed: 527475230
number of failed transactions: 0 (0.000%)
latency average = 2.231 ms
latency stddev = 1.211 ms
initial connection time = 11600.855 ms
tps = 896127.475507 (without initial connection time)

Driver instance 3:

$: pgbench -f ./limitless_pgbench_update.sql -c 2000 -j 2000 -s 1000000 -n -P 60 -T 600
pgbench (16.4)
progress: 60.0 s, 585688.0 tps, lat 2.451 ms stddev 0.611, 0 failed
progress: 120.0 s, 811500.4 tps, lat 2.464 ms stddev 0.612, 0 failed
progress: 180.0 s, 811900.0 tps, lat 2.463 ms stddev 0.627, 0 failed
progress: 240.0 s, 813461.0 tps, lat 2.458 ms stddev 0.619, 0 failed
progress: 300.0 s, 801854.7 tps, lat 2.494 ms stddev 2.531, 0 failed
progress: 360.0 s, 810100.2 tps, lat 2.469 ms stddev 1.505, 0 failed
progress: 420.0 s, 812459.2 tps, lat 2.461 ms stddev 0.621, 0 failed
progress: 480.0 s, 809339.0 tps, lat 2.471 ms stddev 1.564, 0 failed
progress: 540.0 s, 813195.7 tps, lat 2.459 ms stddev 0.635, 0 failed
transaction type: ./limitless_pgbench_update.sql
scaling factor: 1000000
query mode: simple
number of clients: 2000
number of threads: 2000
maximum number of tries: 1
duration: 600 s
number of transactions actually processed: 472903957
number of failed transactions: 0 (0.000%)
latency average = 2.465 ms
latency stddev = 1.185 ms
initial connection time = 17012.864 ms
tps = 811034.386711 (without initial connection time)

Summing up the transactions per second results in 2555636.945298 or >2.5M TPS. This at a latency of 2.35 ms with a latency standard deviation of less than half the latency.

Summary of pgbench

In this section, we explored how to benchmark Aurora Limitless using pgbench, a tool originally optimized for single-node PostgreSQL databases. We demonstrated that at each step of the process—from table creation to data loading and query execution—adjustments are necessary to effectively use the distributed architecture of Aurora Limitless.

We covered the following key points:

  • The importance of sharding tables and choosing the appropriate shard keys in Aurora Limitless
  • How to create and load sharded tables for pgbench in Aurora Limitless
  • The need to adapt data loading strategies, moving from pgbench’s multi-row INSERT statements to a more distributed approach using multiple connections
  • The concept of table collocation and its significance in a sharded environment
  • How to create indexes and primary keys in a sharded setup

Although pgbench is highly optimized for single-node systems, Aurora Limitless offers new opportunities for scalability and performance. However, to fully utilize these benefits, both schemas and workloads must be adapted to suit a sharded database architecture.

Troubleshooting

If your statement isn’t being pushed down and run on a single shard, you should determine if this statement only touches rows with a particular shard key value and potentially reference tables. Consider the following:

  • Are all tables in the statement sharded?
  • Is the table a reference table?
  • Does the statement include an equality predicate on the shard key?
  • Are there tables with different shard keys in the same statement?
  • Are there non-distributed functions involved?

Conclusion

Aurora Limitless enables you create a sharded solution with high performance and scalability. As a sharded solution, consider and implement sharding as well as changes to workloads.

In this post, we used the common tool pgbench to show how single-node solutions might not perform as well as they can in a multi-node solution. We considered the following:

  • Sharded schema – We deliberately selected shard keys and created tables with sharding and collocation.
  • Reference tables – For a table that is mostly read-only with little or infrequent DML, a reference table is the right choice. This is an opportunity to make sure statements are pushed down to a shard.
  • Data load – An initial load is typically done with the data load utility. Single-node optimized loads are slower on multi-node systems, and parallel sessions are better than a few heavyweight sessions.
  • Indexing – To be unique, an index must include the shard key. A table’s primary key will typically amend the shard key in the primary key definition (and index).
  • Workload– The workload includes an equality predicate on the shard key to make sure most of the workload is pushed down to shards. In addition, function signatures include the shard key and are configured to be distributed.

A well-sharded system should include the schema and workload changes we discussed in this post. These workloads can scale to the configured maximum ACU and provide performance well over what single-node systems can offer. Aurora offers operational simplicity, enabling you to focus on the business value.


About the author

Stefan Karlsson is a seasoned professional with more than three decades of experience, including two decades in DBMS performance engineering. His focus is, and has been, on customers, partners, and performance. In his spare time, he enjoys cooking and beating his daughter in video games.