AWS Database Blog
Amazon Aurora PostgreSQL parameters, Part 3: Optimizer parameters
Organizations today have a strategy to migrate from traditional databases and as they plan their migration, they don’t want to compromise on performance, availability, and security features. Amazon Aurora is a cloud native relational database service that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases. The PostgreSQL-compatible edition of Aurora delivers up to 3X the throughput of standard PostgreSQL running on the same hardware, enabling existing PostgreSQL applications and tools to run without requiring modification. The combination of PostgreSQL compatibility with Aurora enterprise database capabilities provides an ideal target for commercial database migrations.
Amazon Aurora PostgreSQL-Compatible Edition has enhancements at the engine level which improves the performance for high concurrent OLTP (Online Transaction Processing – typically facilitate and manage transaction-oriented applications) workload, and also helps bridge the feature gap between commercial engines and open-source engines. While the default parameter settings for Aurora PostgreSQL are good for most of the workloads, customers who migrate their workloads from commercial engines may need to tune some of the parameters according to performance and other non-functional requirements. Even for workloads which are migrated from RDS PostgreSQL to Aurora PostgreSQL, we may need to relook at some of the parameter settings because of architectural differences and engine level optimizations.
In this four part series, we explain parameters specific to Aurora PostgreSQL. We also delve into certain PostgreSQL database parameters that apply to Aurora PostgreSQL, how they behave differently, and how to set these parameters to leverage or control additional features in Aurora PostgreSQL.
In part one of this series, we discussed the instance memory-related parameters and Query Plan Management parameters that can help you tune Amazon Aurora PostgreSQL. In part two, we discussed parameters related to replication, security, and logging. In this part, we will cover Aurora PostgreSQL optimizer parameters that can improve performance of queries. In part four, we will cover parameters which can align Aurora PostgreSQL closer to American National Standards Institute (ANSI) standards and reduce the migration effort when migrating from commercial engines.
Query Optimizer
PostgreSQL uses a cost-based optimizer that works well in most cases. You can review the cost of different steps of a run plan by using the EXPLAIN
command. EXPLAIN
returns the query plan that was chosen by the optimizer, without running it. The optimizer uses statistics collected about the table’s data, metadata, and resource constraints to arrive at an optimal query plan.
Certain metadata about table constraints isn’t factored in by the PostgreSQL optimizer. NOT NULL
and FOREIGN KEY
constraints are examples of this. If the optimizer factors these during the plan generation phase, the corresponding plan is more efficient.
Amazon Aurora PostgreSQL introduced these enhancements in version v2.4.0 (compatible with PostgreSQL 10.11) and 3.1.0 (compatible with PostgreSQL 11.6). All the subsequent minor version releases for Aurora PostgreSQL 2.x, and Aurora PostgreSQL 3.x have these optimizations. These additional query optimizer features are also available in newer major versions release such as Aurora PostgreSQL 4.x (compatible with PostgreSQL 12).
In this post, we discuss how these optimizations work by first going through the default PostgreSQL behavior and then the effect of the enhancements on the same plans on the Aurora platform. You can control all these optimizations with parameters, which allow you to revert back to default PostgreSQL behavior if you need to.
You can enable these parameters in the DB cluster parameter group or DB parameter group. Any change to these parameters is dynamic in nature and doesn’t require a restart of the Aurora PostgreSQL DB instance. You can also change these parameters for a specific session which can be useful for testing the effect of these parameters.
If you’re new to PostgreSQL or databases, we highly recommend that you run the commands and examples. This can help build familiarity with reading and understanding output from the EXPLAIN
command. For this purpose, we have used four tables of pgbench schema, which can be easily recreated by the pgbench tool, along with sample data:
To show the impact of some of the optimizations present in Aurora PostgreSQL, we used a scale of 50,000, meaning pgbench creates data for 50,000 branches. This is roughly 700 GB of data. You can choose a smaller scale as per the instance size.
apg_enable_not_in_transform
This parameter improves the query plan by rewriting a NOT IN
clause to an anti join. An anti join optimization is similar to how the IN
clause is converted to EXISTS
or JOIN
during query optimization. Converting NOT IN
to NOT EXISTS
is complex because of plausible nullability of columns involved in NOT IN
. Consider the following query:
If bid
in pgbench_account
has null values, the result of this query is very different if we express our requirement using the NOT EXISTS
clause:
If the IN
or NOT IN
list has NULL, the evaluation also returns NULL and the matching row isn’t part of your output. In contrast, when using NOT EXISTS, all rows except matching rows are returned (hence anti join).
But what if it’s ensured that outer query doesn’t have any null value for the bid
column in pgbench_accounts
? We can ensure this by adding a not null constraint. The following is the output of the same select after deleting all null values and adding a not-null column:
Now both the queries have the same result:
The following code is the query plan after setting the column to not-null and adding an index:
The Execution Time
reported in the EXPLAIN
output for both queries indicates that NOT IN
runs for more than 1.5 hours, whereas NOT EXISTS
starts fetching the same result set after running for less than 12 milliseconds—an improvement of 560,000 times over NOT IN
.
Now let’s see how Aurora PostgreSQL rewrites the NOT IN
query and generates a optimal plan using the index to perform a faster anti join if apg_enable_not_in_transform
is on:
OLTP systems that typically have some form of batch reporting and light analytic workload where such queries are common can gain noticeable performance improvements by way of this feature. We recommend that you evaluate with rigorous testing of this parameter at the instance level, or choose the session-based approach as demonstrated.
apg_enable_semijoin_push_down
This parameter in Aurora PostgreSQL enables the use of semi join filters for hash joins. Based on selectivity estimates on the inner table, the optimizer builds a bloom filter strategy using the values in the inner table of the hash join. For a hash join the table with lower cardinality is chosen for building a hash. Once a hash for each value is generated, column values from the outer (typically bigger) table is used for evaluating the hash condition to generate matching rows for the join output.
When apg_enable_semijoin_push_down
is enabled, the optimizer avoids a second pass over those rows in the result set which might not satisfy the join condition. To do this, a bloom filter is applied at the time of the sequential scan which is used for building the outer result set. This can slow down the sequential scan a bit, but the bloom filter helps reduce the result set that needs to be checked for hash condition in subsequent stages.
For example, a join between distinct branches in pgbench_account
and those rows in pgbench_teller
where tbalance
is greater than a certain amount result in a hash join. The following code is the explain plan when this parameter is not enabled:
If we allow Aurora PostgreSQL to apply a semi join filter when the result set from pgbench_tellers
is being built, the subsequent join step becomes simpler. In the following explain plan, Aurora PostgreSQL can apply a bloom filter (with estimated selectivity of 0.42117
), which reduces the number of rows that need to be evaluated for the hash condition:
Even though the sequential scan runs slightly slower because of the semi join pushdown, overall query performance is better (less than 49 minutes as compared to a little over 1 hour when the semi join pushdown isn’t enabled). The optimization achieved depends on the selectivity of the inner result set.
Once again, although this is an instance parameter, the scope of the number of SQL queries on your workload that are directly impacted could be quite a few. A session-based usage could allow for the right usage.
apg_enable_remove_redundant_inner_joins
This parameter improves query performance by removing a join operation, which doesn’t impact the final result of the query (the query result remains the same irrespective of whether the join is run). An example use case is when a join is performed between a child and parent table using a foreign key, like the following query:
If we look at the time spent running different steps involved in the query plan, you can see that most of the time has been spent on running the join. The overall run of the query takes 173.109 milliseconds and the hash join itself takes 155.068 milliseconds to return all the rows from that step:
Now let’s look at the table structure for pgbench_tellers
:
In the preceding code, pgbench_tellers
is a child table and the FOREIGN KEY
constraint makes sure that none of the rows in pgbench_teller
have a value that doesn’t exist in the bid
column of pgbench_branches
. So in this case, we can avoid the join if we use a simple SQL:
Most users are aware of these SQL optimization techniques and tune their queries. Unlike the previous two parameters, this parameter can work to catch a rogue or ill-performing unoptimized query and avoid additional I/O. For instance, in some cases, application teams or owners of different services might abstract the underlying implementation and allow only a view-based access to other consumers. For example, we might have a view defined as the following code. The view is used to restrict direct access to certain specific columns:
Now let’s see the default behavior of PostgreSQL (by disabling the parameter – apg_enable_remove_redundant_inner_joins
):
When we query from a view, the view name is replaced with the view query as it is. As you learned in our previous example, the join is performed even though we select columns from only one table. A join might seem necessary to apply the filter, such as selecting only those rows from pgbench_tellers
that exist in pgbench_branches
. If we review the table structure of pgbench_tellers
, it becomes clear that pgbench_teller
only has rows that have a corresponding bid
in pgbench_branches (due to the referential integrity constraint). It’s preferred if the optimizer can consider this information (which can be fetched from catalog tables).
If we enable apg_enable_remove_redundant_inner_joins
, the optimizer starts considering the foreign key relationships to ascertain whether a table can be skipped from INNER JOIN
:
The optimizer is now only getting rows from pgbench_tellers
. Interestingly, the optimizer has introduced an additional filter here that isn’t part of the view definition: bid IS NOT NULL
. That’s because the column bid
in pgbench_tellers
allows null values and the join criteria filtered out those rows.
Both the run time and planning time has reduced. That’s because now the optimizer doesn’t need to check stats for pgbench_branches
and avoids evaluating multiple join options.
Conclusion
In this post, we covered additional SQL optimizations that Aurora PostgreSQL can use to improve query performance. Aurora uses parameters to control these optimizations, which can be set at the Aurora cluster, DB instance, or session level.
For more information about Aurora PostgreSQL parameters, see part one and part two in this series. In part four we discuss ANSI compatibility options which can reduce the code change when migrating applications from commercial databases.
About the authors
Sameer Kumar is a Database Specialist Technical Account Manager at Amazon Web Services. He focuses on Amazon RDS, Amazon Aurora and Amazon DocumentDB. He works with enterprise customers providing technical assistance on database operational performance and sharing database best practices.
Gopalakrishnan Subramanian is a Database Specialist solutions architect at Amazon Web Services. He works with our customers to provide guidance and technical assistance on database projects, helping them improving the value of their solutions when using AWS