AWS Big Data Blog
Build a Simplified ETL and Live Data Query Solution using Redshift Federated Query
You may have heard the saying that the best ETL is no ETL. Amazon Redshift now makes this possible with Federated Query. In its initial release, this feature lets you query data in Amazon Aurora PostgreSQL or Amazon RDS for PostgreSQL using Amazon Redshift external schemas. Federated Query also exposes the metadata from these source databases through system views and driver APIs, which allows business intelligence tools like Tableau and Amazon Quicksight to connect to Amazon Redshift and query data in PostgreSQL without having to make local copies. This enables a new data warehouse pattern—live data query—in which you can seamlessly retrieve data from PostgreSQL databases, or build data into a late binding view, which combines operational PostgreSQL data, analytical Amazon Redshift local data, and historical Amazon Redshift Spectrum data in an Amazon S3 data lake.
Simplified ETL use case
For this ETL use case, you can simplify the familiar upsert pattern with a federated query. You can bypass the need for incremental extracts in Amazon S3 and the subsequent load via COPY
by querying the data in place within its source database. This change can be a single line of code that replaces the COPY
command with a query to an external table. See the following code:
In the preceding example, the table pg.store_sales
resides in PostgreSQL, and you use a federated query to retrieve fresh data to load into a staging table in Amazon Redshift, keeping the actual delete and insert operations unchanged. This pattern is likely the most common application of federated queries.
Setting up an external schema
The external schema pg
in the preceding example was set up as follows:
If you’re familiar with the CREATE EXTERNAL SCHEMA command from using it in Spectrum, note some new parameter options to enable federated queries.
Whereas Amazon Redshift Spectrum references an external data catalog that resides within AWS Glue, Amazon Athena, or Hive, this code points to a Postgres catalog. Also, expect more keywords used with FROM
, as Amazon Redshift supports more source databases for federated querying. By default, if you do not specify SCHEMA
, it defaults to public
.
Within the target database, you identify DATABASE ‘dev’
and SCHEMA ‘retail’
, so any queries to the Amazon Redshift table pg.<some_table>
get issued to PostgreSQL as a request for retail.<some_table>
in the dev
database. For Amazon Redshift, query predicates are pushed down and run entirely in PostgreSQL, which reduces the result set returned to Amazon Redshift for subsequent operations. Going further, the query planner derives cardinality estimates for external tables to optimize joins between Amazon Redshift and PostgreSQL. From the preceding example:
The URI
and PORT
parameters that reference both the PostgreSQL endpoint and port are self-explanatory, but there are a few things to consider in your configuration:
- Use a read replica endpoint in Aurora or Amazon RDS for PostgreSQL to reduce load on the primary instance.
- Set up your Amazon RDS for PostgreSQL instance, Aurora serverless or provisioned instances, and Amazon Redshift clusters to use the same VPC and subnet groups. That way, you can add the security group for the cluster to the inbound rules of the security group for the Aurora or Amazon RDS for PostgreSQL instance.
- If both Amazon Redshift and Aurora or Amazon RDS for PostgreSQL are on different VPCs, set up VPC peering. For more information, see What is VPC Peering?
Configuring AWS Secrets Manager for remote database credentials
To retrieve AWS Secrets Manager remote database credentials, our example uses the following code:
These two parameters are interrelated because the SECRET_ARN
is also embedded in the IAM policy for the role.
If a service like Secrets Manager didn’t exist and you wanted to issue a federated query from Amazon Redshift to PostgreSQL, you would need to supply the database credentials to the CREATE EXTERNAL SCHEMA
command via a parameter like CREDENTIALS
, which you also use with the COPY
command. However, this hardcoded approach doesn’t take into account that the PostgreSQL credentials could expire.
You avoid this problem by keeping PostgreSQL database credentials within Secrets Manager, which provides a centralized service to manage secrets. Because Amazon Redshift retrieves and uses these credentials, they are transient and not stored in any generated code and are discarded after query execution.
Storing credentials in Secrets Manager takes up to a few minutes. To store a new secret, complete the following steps:
- On the Secrets Manager console, choose Secrets.
- Choose Store a new secret.
- In the Store a new secret section, complete the following:
- Supply your PostgreSQL database credentials
- Name the secret; for example, MyRDSCredentials
- Configure rotation (you can enable this at a later time)
- Optionally, copy programmatic code for accessing your secret using your preferred programming languages (which is not needed for this post)
- Choose Next.
You can also retrieve the credentials easily.
- On the Secrets Manager console, choose your secret.
- Choose Retrieve secret value.
The following screenshot shows you the secret value details.
This secret is now an AWS resource referenced via a secret ARN. See the following screenshot.
Setting up an IAM role
You can now pull everything together by embedding the secret ARN into an IAM policy, naming the policy, and attaching it to an IAM role. See the following code:
The following screenshot shows the details of the IAM role called myFederatedQueryRDS
, which contains the MyRDSSecretPolicy policy. It’s the same role that’s supplied in the IAM_ROLE
parameter of the CREATE EXTERNAL SCHEMA
DDL.
Finally, attach the same IAM role to your Amazon Redshift cluster.
- On the Amazon Redshift console, choose your cluster.
- From the Actions drop-down menu, choose Manage IAM roles.
- Choose and add the IAM role you just created.
You have now completed the following steps:
- Create an IAM policy and role
- Store your PostgreSQL database credentials in Secrets Manager
- Create an Amazon Redshift external schema definition that uses the secret and IAM role to authenticate with a PostgreSQL endpoint
- Apply a mapping between an Amazon Redshift database and schema to a PostgreSQL database and schema so Amazon Redshift may issue queries to PostgreSQL tables.
You only need to complete this configuration one time.
Querying live operational data
This section explores another use case: querying operational data across multiple source databases. In this use case, a global online retailer has databases deployed by different teams across distinct geographies:
- Region
us-east-1
runs serverless Aurora PostgreSQL. - Region
us-west-1
runs provisioned Aurora PostgreSQL, which is also configured as a global database with a read replica inus-east-1
. - Region
eu-west-1
runs an Amazon RDS for PostgreSQL instance with a read replica inus-east-1
.
Serverless and provisioned Aurora PostgreSQL and Amazon RDS for PostgreSQL are visible in the Amazon RDS console in Region us-east-1
. See the following screenshot:
For this use case, assume that you configured the read replicas for Aurora and Amazon RDS to share the same VPC and subnets in us-east-1
with the local serverless Aurora PostgreSQL. Furthermore, you have already created secrets for each of these instances’ credentials, and also an IAM role MyCombinedRDSSecretPolicy
, which is more permissive and allows Amazon Redshift to retrieve the value of any Amazon RDS secret within any Region. Be mindful of security in actual production use, however, and explicitly specify the resource ARNs for each secret in separate statements in your IAM policy. See the following code:
External schema DDLs in Amazon Redshift can then reference the combined IAM role and individual secret ARNs. See the following code:
This late binding view abstracts the underlying queries to TPC-H lineitem test data within all PostgreSQL instances. See the following code:
Amazon Redshift can query live operational data across multiple distributed databases and aggregate results into a unified view with this feature. See the following code:
If you examine Remote PG Seq Scan
in the following query plan, you see that predicates are pushed down for execution in all three PostgreSQL databases. Unlike your initial simplified ETL use case, no ETL is performed because data is queried and filtered in place. See the following code:
Combining the data lake, data warehouse, and live operational data
In this next use case, you join Amazon Redshift Spectrum historical data with current data in Amazon Redshift and live data in PostgreSQL. You use a 3TB TPC-DS dataset and unload data from 1998 through 2001 from the store_sales
table in Amazon Redshift to Amazon S3. The unloaded files are stored in Parquet format with ss_sold_date_sk
as partitioning key.
To access this historical data via Amazon Redshift Spectrum, create an external table. See the following code:
The external spectrum schema is defined as the following:
Instead of an Amazon S3 read-only policy, the IAM role mySpectrumRole
contains both AmazonS3FullAccess
and AWSGlueConsoleFullAccess
policies, in which the former allows Amazon Redshift writes to Amazon S3. See the following code:
To make partitioned data visible, the ALTER TABLE ... ADD PARTITION
command needs to specify all partition values. For this use case, 2450816 through 2452275 correspond to dates 1998-01-02 through 2001-12-31, respectively. To generate these DDLs quickly, use the following code:
You can run the generated ALTER TABLE statements individually or as a batch to make partition data visible. See the following code:
The three combined sources in the following view consist of historical data in Amazon S3 for 1998 through 2001, current data local to Amazon Redshift for 2002, and live data for two months of 2003 in PostgreSQL. When you create this late binding view, you have to re-order Amazon Redshift Spectrum external table columns because the previous UNLOAD operation specifying ss_sold_date_sk
as partition key shifted that column’s order to last. See the following code:
You can now run a query on the view to aggregate date and join tables across the three sources. See the following code:
This following federated query ran on a two-node DC2.8XL cluster and took 1 minute and 17 seconds to join store sales in Amazon S3, PostgreSQL, and Amazon Redshift, with the date dimension table in Amazon Redshift, aggregating and sorting row counts by year:
The query plan shows these are full sequential scans running on the three source tables with the number of returned rows highlighted, totaling 8.2 billion. Because Amazon Redshift Spectrum does not generate statistics for external tables, you manually set the numRows
property to the row count for historical data in Amazon S3. See the following code:
You can join with another dimension table local to Amazon Redshift, this time the 30 million row customer
table, and filter by column c_birth_country
. See the following code:
Query performance hardly changed from the previous query. Because the query only scanned one column (ss_sold_date_sk
), it benefits from Parquet’s columnar structure for the historical data subquery. To put it another way, if the historical data is stored as CSV, all the data is scanned, which degrades performance significantly.
Additionally, the TPC-DS model does not store date values in the store_sales
fact table. Instead, a foreign key references the date_dim
table. If you plan on implementing something similar but frequently filter by a date column, consider adding that column into the fact table and have it as a sort key, and also adding a partitioning column in Amazon Redshift Spectrum. That way, Amazon Redshift can more efficiently skip blocks for local data and prune partitions for Amazon S3 data, in the latter, and also push filtering criteria down to Amazon Redshift Spectrum.
Conclusion
Applications of live data integration in real-world scenarios include data discovery, data preparation for machine learning, operational analytics, IoT telemetry analytics, fraud detection, and compliance and security audits. Whereas Amazon Redshift Spectrum extends the reach of Amazon Redshift into the AWS data lake, Federated Query extends its reach into operational databases and beyond.
For more information about data type differences between these databases, see Data Type Differences Between Amazon Redshift and Supported RDS PostgreSQL or Aurora PostgreSQL Databases. For more information about accessing federated data with Amazon Redshift, see Limitations and Considerations When Accessing Federated Data with Amazon Redshift, and to learn more about how the feature works watch this demo.
About the Authors
Tito Mijares is a Data Warehouse Specialist Solutions Architect at AWS. He helps AWS customers adopt and optimize their use of Amazon Redshift. Outside of AWS, he enjoys playing jazz guitar and working on audio recording and playback projects.
Entong Shen is a Senior Software Development Engineer for Amazon Redshift. He has been working on MPP databases for over 8 years and has focused on query optimization, statistics and SQL language features such as stored procedures and federated query. In his spare time, he enjoys listening to music of all genres and working in his succulent garden.
Niranjan Kamat is a software engineer on the Amazon Redshift query processing team. His focus of PhD research was in interactive querying over large databases. In Redshift, he has worked in different query processing areas such as query optimization, analyze command and statistics, and federated querying. In his spare time, he enjoys playing with his three year old daughter, practicing table tennis (was ranked in top 10 in Ohio, USATT rating 2143), and chess.
Sriram Krishnamurthy is a Software Development Manager for Amazon Redshift Query Processing team. He is passionate about Databases and has been working on Semi Structured Data Processing and SQL Compilation & Execution for over 15 years. In his free time, you can find him on the tennis court, often with his two young daughters in tow.