AWS Database Blog
Building fast refresh capability in Amazon RDS for PostgreSQL
This post is a guest post by CDL. In their own words, CDL is a “leading insurtech company based in the UK, featured in the Financial Times Future 100 listing of high-growth UK companies making an impact on their industry. It has a strong track record in the insurance and financial services sectors, with its solutions powering some of the UK’s most profitable insurance retailers. Transacting over 7 million policies on its systems, CDL’s clients include Sainsbury’s Bank, Tesco Bank, Swinton Insurance, and Moneysupermarket.com.”
This post describes how CDL developed fast refresh capability using materialized view logs in Amazon RDS for PostgreSQL. We detail what we built to track changes and offer an alternative to the complete refresh, which slashed the time needed from hours to seconds. We also share open-source software to enable fast refresh with the wider PostgreSQL community and outline the installation process involved.
The challenge
CDL handles millions of transactions each day. We were looking to migrate our business intelligence (BI) platform from Oracle to PostgreSQL. To make the switch viable, we needed the relational database to handle this volume of change and refresh in near-real time so that customers retain access to up-to-the-minute business intelligence.
PostgreSQL only has the capability of a complete refresh. However, our service-level agreements require data to be refreshed every 15 minutes, and the volume of change transacted by CDL meant that the complete refresh process couldn’t handle the size of our materialized views (MVs) within this timescale. Our largest MV log is over 150 GB, which took hours to build through the complete refresh process, and there can be over 150 views per day, depending on the customer.
We use the MVs on our BI solution to essentially de-normalize the data from our online transaction processing (OLTP) layer into logical data marts for our clients to consume and use to perform data analytics. Fast refresh capability was therefore an essential prerequisite for CDL when we switched from Oracle to PostgreSQL.
Fast refresh vs. complete refresh
Fast refresh uses materialized view logs on the underlying tables to keep track of changes, and only the changes since the last refresh are applied to the MV. In contrast, the complete refresh process refreshes all the data and could inevitably take hours for large datasets.
The following diagram illustrates how we use MVs in our solution: we sync our data from the OLTP layer into an RDS instance using Attunity Replicate. We then build the MVs in the RDS instance from the OLTP source tables and refresh the jobs every 15 minutes using refresh schedules.
Developing the process for PostgreSQL
The fast refresh process was designed to be installed into its own schema that contains the functions needed to run the MV process, with three data dictionary tables and one role.
The workflow for the MV log creation in the following diagram is as follows:
- Add the UUID column on the source table
- Create the MV log for the source table
- Add indexes on MV log table
- Create the trigger on the source table
- Insert the entry into the
pgmview_logs
table
The workflow for the MV creation shown in the following diagram is as follows:
- Deconstruct the SQL statement
- Extract the compound MV tables
- Create the base table from MV SQL
- Add the
rowid
to the source table - Add index to
rowid
column - Insert into the
pgmviews
table - Insert in the
pgmviews_details
table - Complete refresh on the MV
There are four main functions a user runs:
- mv$createMaterializedViewlog – Creates the MV logs
- mv$createMaterializedView – Creates the MV
- mv$removeMaterializedViewLog – Removes the MV logs
- mv$removeMaterializedView – Removes the MV
Source code
Download the source code from the GitHub repo. See the following code:
Installing the fast refresh functions
The install of the fast refresh functions is designed to live in its own schema in the database that is specified via the MODULEOWNER
parameter.
To install the MV code, complete the following steps:
- Navigate to the folder where you downloaded the repo and edit the
module_set_variable.sh
file.
This is where all the variables are stored for where you want to install the fast refresh functions.
You don’t need theSOURCEUSERNAME
/SOURCEPASSWORD
andMVUSERNAME
/MVPASSWORD
parameters to install the fast refresh functions; you use them for the test harness setup. See the following code:As an example of the parameter settings used in the test case, you have an RDS instance
pg-tonytest.citnv5igau2v.eu-west-1.rds.amazonaws.com
with a databasetestpoc
and a master usernamedbamin
.
The install package installs the fast refresh functions under the schematestpoc
.
You have a source schematestpocsource
. This is where the source data tables go for the test harness and atestpocmv
, which is the schema where the MV is built. See the following code: - Change the permissions on the script
runCreateFastRefreshModule.sh
to execute and then run. See the following code:This should just take seconds to run.
- When it’s complete, check the log file in the location you set.
For this post, the log file is in/tmp
. The status is at the bottom. See the following example code:
Seeing the functions in the database
If you connect to the database on which you installed the package, you should see the functions under the MODULEOWNER
. For this post, the functions are under testpoc
.
Installing the test harness
The following use case demonstrates how to create a module view (MV). You create six tables and insert some data into the tables. All the following steps are in the create_test_harness.sh
script and use the variables set in module_set_variables.sh
.
- Create a schema for the test data. See the following code:
- Create a schema for the MV. See the following code:
- Create the test data as the
sourceusername
. See the following code:
- Create the MV logs. See the following code:
- Create the MV. See the following code:
- Select data from the MV. See the following code:
Updating the test harness
This is a very simple test to show how the fast refresh process works. For this post, you update a row in the base table t1
and see how that is reflected in the MV after a fast refresh. This test script is update_mv_test.sh
.
- Update a row for table
t1
with the codeUPDATE testpocsource.t1 set code='yo' where code='hello'
: - Check that the MV the data isn’t changed (because it has not been refreshed). The following code shows that it is still set to
hello
:
- Do a fast refresh on the MV and check it again. The following code shows that the row has changed:
- Change the row back to
hello
by updating thet1
table and refreshing the view again. See the following code:
Removing the test harness
To remove the test harness, run the drop_test_harness.sh script. This removes the data and MV schemas from the database. See the following code:
Removing the fast refresh functions
To remove the fast refresh functions from the database, run the dropFastRefreshModule.sh
script. This prompts you to enter Y
to proceed with the removal of the objects. See the following code:
Conclusion
This post reviewed the fast refresh module and provided a simple example of getting an MV up and running. To enable our customers to access real-time business intelligence, we had to find a unique solution. We hope you enjoyed learning about how to develop fast refresh capabilities using MV logs with PostgreSQL. You can download the code from the GitHub repo.
Stay tuned for part two of this series, where we’ll discuss the module and results we have seen in running this solution in the production environment.
About the Author
Tony Mullen is a Principal Data Engineer at CDL. He has over 18 years’ experience in databases; primarily Oracle, with a shift towards PostgreSQL over recent years. He heads up the migration of key applications from Oracle on-prem to Amazon RDS PostgreSQL, and has a keen interest in building the PostgreSQL community in Manchester, co-hosting the PostgreSQL Manchester meet-up.