Perform maintenance tasks and schema modifications in Amazon RDS for PostgreSQL with minimal downtime using Blue Green deployment

In this post, we walk you through performing schema changes and common maintenance tasks such as table and index reorganization, VACUUM FULL, and materialized view refreshes with minimal downtime using blue/green deployments for an Amazon Relational Database (Amazon RDS) for PostgreSQL database or an Amazon Aurora PostgreSQL-Compatible Edition cluster.

Solution overview

Amazon RDS blue/green deployment enables you to perform changes in a staging environment that are isolated from the production system. It first creates a staging environment (green) that is a topological copy of your current production environment (blue). The blue and green environments stay in sync using logical replication. This feature is available on Amazon RDS for PostgreSQL version 11.21 onwards.

The blue/green environment is ready for performing maintenance activities when there is minimal or no replication lag between the source database (blue) and staging database (green). You can perform the maintenance activities in the green environment, while the application continues to use the blue database. It’s highly recommended you schedule any maintenance activities at a low-activity time for your application. This will avoid the higher replica lag, and in turn improve the overall performance of the test.

With blue/green deployment, you can address the following use cases with minimal downtime or no impact on the source environment:

Gather statistics using ANALYZE
Perform VACUUM FULL on catalog tables
Adjust the fill factor for a table
Rebuild indexes for optimization
Add new columns with and without default values for non-serial data types
Expand the column size in a table
Refresh materialized views
Enable the required extensions in shared_preload_libraries

Now, we will delve into the scenarios where you can use blue/green deployment to implement database changes with minimal downtime. Refer to limitations for blue/green deployments to ensure compatibility for using this feature.

Prerequisites

Complete the following steps to set up the test environment:

Create an RDS for PostgreSQL instance or an Amazon Aurora PostgreSQL cluster if you don’t already have one. For instructions, refer to Create a PostgreSQL DB instance or Create an Amazon PostgreSQL DB cluster, respectively.
Create a new cluster parameter group with rds.logical_replication as 1 to have WAL translate to logical operations and max_logical_replication_workers as 10 (there is one logical replication slot for each customer database along with the rdsadmin database).
Associate the cluster parameter group to newly created RDS instance. For more information, refer to New – Fully managed Blue/Green Deployment in Amazon Aurora PostgreSQL and Amazon RDS for PostgreSQL.
Create an Amazon Elastic Compute Cloud (Amazon EC2) instance to install the PostgreSQL client to access the RDS for PostgreSQL instance. For instructions, refer to Create your EC2 resources and launch your EC2 instance. Or you can set up connectivity between your RDS database and EC2 compute instance in one click.
Install the PostgreSQL client. On Amazon Linux 2023, you can use the following commands to download the psql command line tool:
```
sudo dnf install postgresql15.x86_64
```

Create a sample database for testing on the source database:

postgres=> create database analyze_test_db;
CREATE DATABASE

We use pgbench, a simple PostgreSQL benchmarking tool, for our test case. Run the pgbench command with the insert-only workload option to create four tables and ingest test data:

[ec2-user@ip-10-0-32-224 ~]$ pgbench -h ******.*******.us-east-1.rds.amazonaws.com -p 5432 -U benchmark_dba -i -s 150 analyze_test_db;
Password:
dropping old tables...
NOTICE:  table "pgbench_accounts" does not exist, skipping
NOTICE:  table "pgbench_branches" does not exist, skipping
NOTICE:  table "pgbench_history" does not exist, skipping
NOTICE:  table "pgbench_tellers" does not exist, skipping
creating tables...
generating data (client-side)...
15000000 of 15000000 tuples (100%) done (elapsed 63.97 s, remaining 0.00 s)
vacuuming...
creating primary keys...
done in 111.76 s (drop tables 0.00 s, create tables 0.02 s, client-side generate 65.00 s, vacuum 9.49 s, primary keys 37.25 s)

Create the RDS blue/green environment. Refer to New – Fully managed Blue/Green Deployment in Amazon Aurora PostgreSQL and Amazon RDS for PostgreSQL for the prerequisites and instructions to set up the blue/green environment.
When the blue/green environment is ready and replication lag is minimal, establish a connection to the green environment using the psql client. The default setting for the green environment is read-only mode. To enable write operations in the green environment, run the command to set the default_transaction_read_only parameter to off at the session level. This allows write operations in the green environment.
```
analyze_test_db=> SET default_transaction_read_only TO off;
SET
```
Note: You can run VACUUM and ANALYZE operations in the green environment without enabling ‘write’ at session level.

Verify the database session is in read/write mode:
```
analyze_test_db=> SHOW default_transaction_read_only;
default_transaction_read_only
-------------------------------
off
(1 row)
```

Please ensure to continually monitor the replication lag between the blue and green environment as follows:

Monitor replication lag on green environment

When making write operation changes to the green environment, it’s important to monitor the replication slots from the blue environment. A replication lag will continue to increase in case of any logical replication issues. If this happens, the green environment will not consume WAL and the disk usage will increase resulting in the database to hang.
Following query checks the gap from the blue environment using log sequence number (lsn).

analyze_test_db=> SELECT slot_name,confirmed_flush_lsn,pg_current_wal_lsn(),(pg_current_wal_lsn() - confirmed_flush_lsn) AS lsn_distance
FROM pg_replication_slots;
slot_name                        | confirmed_flush_lsn | pg_current_wal_lsn   | lsn_distance
--------------------------------------------------------+---------------------+--------------------+--------------
rds_us_east_1_ojyxjuttnhucw4q7vdg5vbo5fq_bg_slot_16435 | F/7C422EC0          | F/832607E0         |    115595552
(1 row)

Monitor the RDS for PostgreSQL log in the green environment:

analyze_test_db=> SELECT slot_name,confirmed_flush_lsn,pg_current_wal_lsn(),(pg_current_wal_lsn() - confirmed_flush_lsn) AS lsn_distance
FROM pg_replication_slots;
slot_name                        | confirmed_flush_lsn | pg_current_wal_lsn | lsn_distance
--------------------------------------------------------+---------------------+--------------------+--------------
rds_us_east_1_eroldf42vugb3zhgtzqo6x4nni_bg_slot_14717 | A/C4000A10          | A/C4000A10         |            0
rds_us_east_1_eroldf42vugb3zhgtzqo6x4nni_bg_slot_16433 | A/C4000A10          | A/C4000A10         |            0
rds_us_east_1_eroldf42vugb3zhgtzqo6x4nni_bg_slot_16434 | A/BC0008D0          | A/C4000A10         |    134218048
rds_us_east_1_eroldf42vugb3zhgtzqo6x4nni_bg_slot_16435 | A/C4000A10          | A/C4000A10         |            0
rds_us_east_1_eroldf42vugb3zhgtzqo6x4nni_bg_slot_16436 | A/C4000A10          | A/C4000A10         |            0
rds_us_east_1_eroldf42vugb3zhgtzqo6x4nni_bg_slot_16384 | A/C4000A10          | A/C4000A10         |            0
(6 rows)

Use Cases

With the test environment set up for blue/green deployment, let’s dive into practical code examples showcasing various maintenance tasks achievable through Amazon RDS blue/green deployments.

Gather statistics using ANALYZE

The Major Version Upgrade (MVU) of Amazon RDS for PostgreSQL database employs the pg_upgrade module. However, this module doesn’t transfer optimizer statistics from the prior version, which can hamper the performance of queries post-upgrade. Therefore, it’s crucial to run the ANALYZE command to regenerate these statistics for each database in the primary instance.

The blue/green deployment method is an effective strategy to mitigate the impact of this operation. It allows users to run the ANALYZE operation on the green (new) environment before transitioning the application workload to it.

However, it’s important to note that during a blue/green deployment setup for a major version upgrade, the optimizer statistics will not be available in the green environment. You can validate these table-level statistics by querying pg_stats:

analyze_test_db=> SELECT avg_width,n_distinct,correlation from pg_stats WHERE tablename='pgbench_accounts' AND attname='aid';
(0 rows)

If no statistics are found, they need to be updated. Although the green environment continues to be logically replicated from the blue environment, the ANALYZE operation is not logically replicated. Therefore, you must manually run the ANALYZE command in the green environment.

The following is the command to analyze the tables on the green environment:

analyze_test_db=> ANALYZE VERBOSE pgbench_accounts;
INFO:  analyzing "public.pgbench_accounts"
INFO:  "pgbench_accounts": scanned 30000 of 245902 pages, containing 1830000 live rows and 0 dead rows; 30000 rows in sample, 15000022 estimated total rows
ANALYZE
Time: 9043.929 ms (00:09.044)

This command returns a detailed analysis of the public.pgbench_accounts table, including the number of live rows, dead rows, rows in the sample, and estimated total number of rows.

Validate the statistics gathered by the ANALYZE command to confirm they’ve been updated correctly:

analyze_test_db=> SELECT avg_width,n_distinct,correlation from pg_stats WHERE tablename='pgbench_accounts' AND attname='aid';
-[ RECORD 1 ]---
avg_width   | 4
n_distinct  | -1
correlation | 1

Perform VACUUM FULL on catalog tables

In PostgreSQL, bloat refers to the excess space that is occupied by dead tuples, which are essentially remnants of old data that is no longer valid. This situation can occur due to frequent updates and deletions in your database tables.

When it comes to catalog table bloat, this can be primarily caused by high frequency of creating and dropping objects within your database. To remove this bloat and reduce the size of the table, it is necessary to perform a VACUUM FULL operation, which will require ACCESS EXCLUSIVE LOCK on the tables. This lock on catalog tables won’t allow any queries to the database.

Although tools like pg_repack can help reduce downtime when dealing with bloat on user tables by allowing us to perform a VACUUM FULL operation with locking the table for a shorter amount of time, this doesn’t work with catalog tables. For more information on pg_repack, see Remove bloat from Amazon Aurora and RDS for PostgreSQL with pg_repack.

However, in a blue/green deployment model, you can effectively manage this downtime. You can run the VACUUM FULL operation on the green environment while the blue environment continues to handle live traffic, causing minimal disruption to your database operations. The following is the command to run on the green environment:

analyze_test_db=> SET default_transaction_read_only TO off;
SET

analyze_test_db=> vacuum full pg_class;
VACUUM

Adjust the fill factor for a table

Adjusting the fill factor of a table in a database is a crucial operation that can have significant impact on performance. The fill factor is essentially the percentage of space on each page (the basic unit of storage in PostgreSQL) to be filled with data, reserving the rest for updates of that data. For example, a fill factor of 70 means PostgreSQL will try to leave 30% of every leaf page unfilled when it writes it out, so that subsequent updates of existing data can place the updated rows on the same page as the original. This can avoid the need for costly page splits.

For example, your business might have an application with a table with a high volume of updates. Initially, the table was created with a fill factor of 100, which means the pages are completely filled when data is inserted. Over time, as the statuses are updated, the database needs to find new pages to store the updated rows, leading to bloat and an inefficient use of space.

Since the database spends a lot of time looking for free space and moving rows around, these frequent updates cause performance issues. In these types of workloads, you can change the fill factor of this table to leave enough free space for updates.

However, changing the fill factor of a table needs an ACCESS EXCLUSIVE LOCK for a significant amount of time (depending on the size of the table). To avoid the downtime, you can run this operation on the green environment. Complete the following steps:

Set the session as read/write and then perform the ALTER:

analyze_test_db=> SET default_transaction_read_only TO off;
SET
analyze_test_db=> SHOW default_transaction_read_only;
default_transaction_read_only
-------------------------------
off
(1 row)

Change the FillFactor for the pgbench_tellers table:

analyze_test_db=> ALTER TABLE IF EXISTS public.pgbench_tellers SET (FILLFACTOR=10);
ALTER TABLE
analyze_test_db=> \d+ pgbench_tellers;
                                         Table "public.pgbench_tellers"
Column  |     Type      | Collation | Nullable | Default | Storage  | Compression | Stats target | Description
----------+---------------+-----------+----------+---------+----------+-------------+--------------+-------------
tid      | integer       |           | not null |         | plain    |             |              |
bid      | integer       |           |          |         | plain    |             |              |
tbalance | integer       |           |          |         | plain    |             |              |
filler   | character(84) |           |          |         | extended |             |              |
Indexes:
"pgbench_tellers_pkey" PRIMARY KEY, btree (tid)
Access method: heap
Options: fillfactor=10

Reorganize the table to make fillfactor effective for all blocks in the table:

analyze_test_db=> vacuum full pgbench_tellers;
VACUUM
Query returned successfully in 3 min 13 secs.

Concurrent to the preceding operation, we can run a heavy DML workload activity on the table in the blue environment:

[ec2-user@ip-10-0-32-224 ~]$ pgbench --host=******.*******.us-east-1.rds.amazonaws.com analyze_test_db -U postgres -b simple-update -T 600
Password:
pgbench (15.4, server 14.9)
starting vacuum...end.

Due to the heavy workload on the table, we saw an increase in the lag (lsn_distance), as shown in the following code. However, it didn’t impact the replication. The table was altered successfully in the green environment without any issues reported. Also, the green environment was able to catch up on the lag from the blue environment.

analyze_test_db=> SELECT slot_name,confirmed_flush_lsn,pg_current_wal_lsn(),(pg_current_wal_lsn() - confirmed_flush_lsn) AS lsn_distance
FROM pg_replication_slots;
slot_name                        | confirmed_flush_lsn | pg_current_wal_lsn | lsn_distance
--------------------------------------------------------+---------------------+--------------------+--------------
rds_us_east_1_ojyxjuttnhucw4q7vdg5vbo5fq_bg_slot_16435 | F/809D2708          | F/859F3B88         |     84022400
(1 row)

Rebuild indexes for optimization

Reindexing is an essential operation that is typically run when indexes become corrupt or bloated. PostgreSQL v12 offers a feature that allows the recreation of indexes using the CONCURRENT option. This feature enables the rebuilding of indexes while minimizing the locking of writes, which is a significant improvement for high availability tables where even a brief lock might not be acceptable.

Using the blue/green deployment model can be particularly beneficial in such cases. By building indexes with the CONCURRENT option in the green environment, you can minimize performance impacts on the application and prevent locking issues.

To confirm that the database is in read/write mode, run the following commands:

analyze_test_db=> SET default_transaction_read_only TO off;
SET
index_test_db=> SHOW default_transaction_read_only;
default_transaction_read_only
-------------------------------
off
(1 row)

After the validation, perform the REINDEX operation on the index:

analyze_test_db=> REINDEX INDEX CONCURRENTLY pgbench_accounts_pkey;
REINDEX

In our test, we observed it was noted that the green environment experienced high IO:DataFileRead waits, as shown in the following screenshot. This is expected due to the concurrent reading and rebuilding of the index. Importantly, no locking or other waits were observed due to the brief nature of the cutover facilitated by the CONCURRENT option.

Add new columns with and without default values for non-serial data types

In this section, we demonstrate how to add a column to a table in the green environment. The ALTER TABLE operation supports adding a column with or without a default value. However, the addition of a serial data type (auto incremented columns) is not supported. It will break the logical replication.

After you alter the table by adding a column (with a default value and with no default value), the replication works fine. It can be a useful feature for dynamic environments where it is required to add a column in the table without impacting the application availability.

analyze_test_db=> ALTER TABLE DDL_TEST ADD COLUMN a2 int DEFAULT 10;
ALTER TABLE
Query returned successfully in 166 msec.
analyze_test_db=> ALTER TABLE DDL_TEST ADD COLUMN a5 int;
ALTER TABLE
Query returned successfully in 60 msec.

Blue/green deployment does not support the addition of auto-incremented columns (serial data type). As shown in the following logs from the green environment, the serial data type cannot set the nextval(), but can add an int column with no issues:

2023-12-07 23:45:13 UTC::@:[685]:ERROR: cannot execute nextval() in a read-only transaction
2023-12-07 23:45:13 UTC::@:[395]:LOG: background worker "logical replication worker" (PID 685) exited with exit code 1
2023-12-07 23:45:18 UTC::@:[688]:LOG: logical replication apply worker for subscription "rds_us_east_1_4pblch2ievqphi4emjgvmxl3ie_bg_sub_16434" has started
2023-12-07 23:45:18 UTC::@:[688]:ERROR: cannot execute nextval() in a read-only transaction
2023-12-07 23:45:18 UTC::@:[395]:LOG: background worker "logical replication worker" (PID 688) exited with exit code 1

Expanding column size in a table

In database management, it’s possible to expand data types based on your requirements. For example, you can scale from an integer (int) to a large integer type like bigint, or even further to numeric. Similarly, the character data type (char(n)) can be expanded to varchar or text.

However, such expansions require a certain amount of downtime. The duration of this downtime is directly proportional to the size of the table you’re altering. Larger tables will demand a longer downtime.

To minimize the impact of this downtime, you can expand the column in the green environment.

On the green environment, expand the id column from INT to BIGINT:

analyze_test_db=> SET default_transaction_read_only to off;
SET
Time: 46.450 ms
analyze_test_db=> ALTER TABLE test_column_expand ALTER COLUMN id TYPE bigint;
ALTER TABLE
Analyze_test_db=> \d test_column_expand
Table "public.test_column_expand"
Column | Type | Collation | Nullable | Default
--------+-------------------+-----------+----------+---------
id | bigint | | |
name | character varying | | |

Create another session on the database and from that session, check the lock on the table:

analyze_test_db=> select relation::regclass, mode from pg_locks where relation='test_column_expand'::regclass;
relation | mode
--------------------+---------------------
test_column_expand | AccessExclusiveLock
(1 row)

Check the replication by updating a row on the blue environment:

On the blue environment run:

analyze_test_db=> select * from test_column_expand where id=1;
id | name
----+-------
1 | test1
(1 row)
analyze_test_db=> update test_column_expand set name='modified_column_value' where id=1;
UPDATE 1
Time: 6705.917 ms (00:06.706)
analyze_test_db=> select * from test_column_expand where id=1;
id | name
----+-----------------------
1 | modified_column_value
(1 row)

On the green environment run:

analyze_test_db=> select * from test_column_expand where id=1;
id | name
----+-----------------------
1 | modified_column_value
(1 row)

Review the metrics in Amazon RDS Performance Insights of the blue environment to confirm that there was no locking due to the column expand on the green environment.

Refresh materialized views

A current limitation of logical replication is that a materialized view refresh is not replicated because there is no WAL for this operation. Alternatively, you can perform a refresh manually on the green just prior to switchover to have up-to-date views in preparation for the green environment switch.

Let’s set up the environment:

Create a materialized view on the green environment:

analyze_test_db=> CREATE MATERIALIZED VIEW mv_pgbench_join_tables
AS SELECT t1.aid FROM pgbench_accounts t1
INNER JOIN pgbench_branches t2 ON t1.bid = t2.bid
INNER JOIN pgbench_history t3 ON t2.bid = t3.bid WITH DATA;

Perform an INSERT operation on the underlying tables of the materialized view in the blue environment:

analyze_test_db=> INSERT INTO public.accounts2 SELECT generate_series(50000,50000000),1,0,'somestring';
INSERT 0 1

Refresh the materialized view in the green environment:

analyze_test_db=> REFRESH MATERIALIZED VIEW mv_pgbench_join_tables;
REFRESH MATERIALIZED VIEW

During creation as well as during the refresh of the materialized view, there isn’t lag on the logical replication with data being actively inserted on the underlying tables:

index_test_db=> SELECT slot_name,confirmed_flush_lsn,pg_current_wal_lsn(),(pg_current_wal_lsn() - confirmed_flush_lsn) AS lsn_distance
FROM pg_replication_slots;
slot_name | confirmed_flush_lsn | pg_current_wal_lsn | lsn_distance
--------------------------------------------------------+---------------------+--------------------+--------------
rds_us_east_1_ojyxjuttnhucw4q7vdg5vbo5fq_bg_slot_16436 | F/A0001FA0 | F/A0001FA0 | 0
(1 row)

You can benefit from this because you can create and refresh your materialized views on the green environment.

analyze_test_db=> select schemaname as schema_name,matviewname as view_name,matviewowner as owner,ispopulated as is_populated,definition
from pg_matviews order by schema_name,view_name;
schema_name | view_name | owner | is_populated | definition
-------------+------------------------+----------+--------------+--------------------------------------------------------
public | mv_pgbench_join_tables | postgres | t | SELECT t1.aid FROM accounts2 t1 INNER JOIN accounts3 t2 ON t1.bid = t2.bid
| | | |
| | | |
(1 row)

After the switchover, the materialized view is available for use in the green environment. A pre-switchover refresh of a materialized view can improve user experience and downtime.

analyze_test_db=> EXPLAIN ANALYZE SELECT * FROM mv_pgbench_join_tables LIMIT 1000;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..0.07 rows=5 width=4) (actual time=1.107..1.110 rows=1000 loops=1)
-> Seq Scan on mv_pgbench_join_tables (cost=0.00..216371.13 rows=14999913 width=4) (actual time=1.104..1.106 rows=5 loops=1)
Planning Time: 10.904 ms
Execution Time: 1.137 ms
(4 rows)

Enable required extensions in shared_preload_libraries

To successfully enable required extensions, an important preliminary step entails updating the shared_preload_libraries parameter. This move is crucial because it sets the stage for the rest of the process, laying the groundwork for the necessary changes to be implemented smoothly. However, this change necessitates a system restart.

In an ideal situation where you have a green environment set up, this process becomes even more streamlined. The parameter can be effectively changed, and the system can be restarted in the green environment. This method prevents any disruption to the ongoing operations in the blue environment. When the green environment is up and running smoothly with the updated parameter, the application can then be strategically pointed from the blue environment to the green. This careful, step-by-step approach results in minimal disruptions while maintaining optimal operational efficiency.

To update this parameter on the green environment, you need to modify the value in the cluster parameter group and restart the instance.

Limitations

The standard PostgreSQL logical replication limitations apply to blue/green deployments. For more information, refer to Limitations for blue/green deployments.

Clean up

To clean up after switchover, delete the database instances in the blue environment. If you created a new RDS for PostgreSQL instance to follow along with this post, delete the blue/green deployment object and green environment as well. For instructions, see Deleting a DB instance.

Conclusion

In this post, we showed you a method to other than manual operations that can’t be logically replicated, the green environment is well-suited for maintenance operations that could potentially disrupt the performance of a production database. This includes altering a table’s fill factor, removing table bloat using VACUUM FULL (or pg_repack), and removing index bloat using REINDEX CONCURRENTLY.

This post demonstrated a method to implement schema changes in Amazon RDS for PostgreSQL with minimal impact using blue/green deployments. You can apply the same approach to Amazon Aurora PostgreSQL.

We welcome your feedback; leave your comments or questions in the comments section.

About the authors

Anita Singh is a Principal Specialist Solutions Architect with over 20 years of experience, architecting solutions focused around Database technologies. She guides customers with designing and building highly scalable, available and performant applications. She actively engages and advices customers on their database migration and modernization journey to or within the AWS cloud infrastructure.

Baji Shaik is a Sr. Lead Consultant with AWS Professional Services, Global Competency Centre. His background spans a wide depth and breadth of expertise and experience in SQL and NoSQL database technologies. He is a Database Migration Expert and has developed many successful database solutions addressing challenging business requirements for moving databases from on premises to Amazon RDS and Amazon Aurora PostgreSQL/MySQL. He is an eminent author, having written several books on PostgreSQL. A few of his recent works include “PostgreSQL Configuration,” “Beginning PostgreSQL on the Cloud,” and “PostgreSQL Development Essentials.”

Sarabjeet Singh is a Database Specialist Solutions Architect at AWS. He works with customers to provide guidance and technical assistance on database projects, helping them improve the value of their solutions when using AWS.

Nathan Ballance is a Cloud Support Engineer (Databases) at Amazon Web Services, specializing in providing customized solutions for AWS customers. He provides troubleshooting support and collaborates with AWS service teams to resolve complex customer issues, addressing a wide range of technical challenges.