AWS Big Data Blog
Getting started guide for near-real time operational analytics using Amazon Aurora zero-ETL integration with Amazon Redshift
November 2023: This post was reviewed and updated to include the latest enhancements in Amazon Aurora MySQL zero-ETL integration with Amazon Redshift on general availability (GA).
Amazon Aurora zero-ETL integration with Amazon Redshift was announced at AWS re:Invent 2022 and is now generally available (GA) for Aurora MySQL 3.05.0 (compatible with MySQL 8.0.32) and higher version in multiple regions. For more details, refer to the Whats New Post .
In this post, we provide step-by-step guidance on how to get started with near-real time operational analytics using this feature.
Challenges
Customers across industries today are looking to increase revenue and customer engagement by implementing near-real time analytics use cases like personalization strategies, fraud detection, inventory monitoring, and many more. There are two broad approaches to analyzing operational data for these use cases:
- Analyze the data in-place in the operational database (e.g. read replicas, federated query, analytics accelerators)
- Move the data to a data store optimized for running analytical queries such as a data warehouse
The zero-ETL integration is focused on simplifying the latter approach.
A common pattern for moving data from an operational database to an analytics data warehouse is via extract, transform, and load (ETL), a process of combining data from multiple sources into a large, central repository (data warehouse). ETL pipelines can be expensive to build and complex to manage. With multiple touchpoints, intermittent errors in ETL pipelines can lead to long delays, leaving applications that rely on this data to be available in the data warehouse with stale or missing data, further leading to missed business opportunities.
For customers that need to run unified analytics across data from multiple operational databases, solutions that analyze data in-place may work great for accelerating queries on a single database, but such systems have a limitation of not being able to aggregate data from multiple operational databases.
Zero-ETL
At AWS, we have been making steady progress towards bringing our zero-ETL vision to life. With Aurora zero-ETL integration with Amazon Redshift, you can bring together the transactional data of Aurora with the analytics capabilities of Amazon Redshift. It minimizes the work of building and managing custom ETL pipelines between Aurora and Amazon Redshift. Data engineers can now replicate data from multiple Aurora database clusters into the same or a new Amazon Redshift instance to derive holistic insights across many applications or partitions. Updates in Aurora are automatically and continuously propagated to Amazon Redshift so the data engineers have the most recent information in near-real time. Additionally, the entire system can be serverless and can dynamically scale up and down based on data volume, so there’s no infrastructure to manage.
When you create an Aurora zero-ETL integration with Amazon Redshift, you continue to pay for Aurora and Amazon Redshift usage with existing pricing (including data transfer). The Aurora zero-ETL integration with Amazon Redshift feature is available at no additional cost.
With Aurora zero-ETL integration with Amazon Redshift, the integration replicates data from the source database into the target data warehouse. The data becomes available in Amazon Redshift within seconds, allowing users to use the analytics features of Amazon Redshift and capabilities like data sharing, workload optimization autonomics, concurrency scaling, machine learning, and many more. You can perform transaction processing on data in Aurora while simultaneously using Amazon Redshift for analytics workloads such as reporting and dashboards.
The following diagram illustrates this architecture.
Solution overview
Let’s consider TICKIT, a fictional website where users buy and sell tickets online for sporting events, shows, and concerts. The transactional data from this website is loaded into an Aurora MySQL 3.05.0 (or higher version) database. The company’s business analysts want to generate metrics to identify ticket movement over time, success rates for sellers, and the best-selling events, venues, and seasons. They would like to get these metrics in near-real time using a zero-ETL integration.
The integration is set up between Amazon Aurora MySQL-Compatible Edition 3.05.0 (source) and Amazon Redshift (destination). The transactional data from the source gets refreshed in near-real time on the destination, which processes analytical queries.
You can use either the provisioned or serverless option for both Amazon Aurora MySQL-Compatible Edition as well as Amazon Redshift. For this illustration, we use a provisioned Aurora database and an Amazon Redshift Serverless data warehouse.
The following diagram illustrates the high-level architecture.
The following are the steps needed to set up zero-ETL integration. For complete getting started guides, refer to the following documentation links for Aurora and Amazon Redshift.
- Configure the Aurora MySQL source with a customized DB cluster parameter group.
- Configure the Amazon Redshift Serverless destination with the required resource policy for its namespace.
- Update the Redshift Serverless workgroup to enable case-sensitive identifiers.
- Configure the required permissions.
- Create the zero-ETL integration.
- Create a database from the integration in Amazon Redshift.
Configure the Aurora MySQL source with a customized DB cluster parameter group
To create an Aurora MySQL database, complete the following steps:
- On the Amazon RDS console, create a DB cluster parameter group called
zero-etl-custom-pg
.
Zero-ETL integrations require specific values for the Aurora DB cluster parameters that control binary logging (binlog). For example, enhanced binlog mode must be turned on (aurora_enhanced_binlog=1
).
- Set the following binlog cluster parameter settings:
binlog_backup
=0binlog_replication_globaldb
=0binlog_format
=ROWaurora_enhanced_binlog
=1binlog_row_metadata
=FULLbinlog_row_image
=FULL
In addition, make sure that the binlog_transaction_compression
parameter is not set to ON, and that the binlog_row_value_options
parameter is not set to PARTIAL_JSON
. By default, these parameters are not set.
- Choose Save changes.
- Choose Databases in the navigation pane, then choose Create database.
- For Available versions, choose Aurora MySQL 3.05.0 (or higher).
- For Templates, select Production.
- For DB cluster identifier, enter
zero-etl-source-ams
.
- Under Instance configuration, select Memory optimized classes and choose a suitable instance size (the default is
db.r6g.2xlarge
).
- Under Additional configuration, for DB cluster parameter group, choose the parameter group you created earlier (
zero-etl-custom-pg
).
- Choose Create database.
In a couple of minutes, it should spin up an Aurora MySQL database as the source for zero-ETL integration.
Configure the Redshift Serverless destination
After you create your source DB cluster, you must create and configure a target data warehouse in Amazon Redshift. The data warehouse must meet the following requirements:
- Using an RA3 node type (
ra3.16xlarge
,ra3.4xlarge
, orra3.xlplus
), or Redshift Serverless - Encrypted (if using a provisioned cluster).
For our use case, create a Redshift Serverless data warehouse by completing the following steps:
- On the Amazon Redshift console, choose Serverless dashboard in the navigation pane.
- Choose Create workgroup.
- For Workgroup name, enter
zero-etl-target-rs-wg
. Choose base RPU capacity in 8 to 512 range available in increments of 8.
- For Namespace, select Create a new namespace and enter
zero-etl-target-rs-ns
.
- Navigate to the namespace
zero-etl-target-rs-ns
and choose the Resource policy tab. - Choose Add authorized principals.
- Enter either the Amazon Resource Name (ARN) of the AWS user or role, or the AWS account ID (IAM principals) that are allowed to create integrations in this namespace.
An account ID is stored as an ARN with root user.
- Add an authorized integration source to the namespace and specify the ARN of the Aurora MySQL DB cluster that’s the data source for the zero-ETL integration.
- Choose Save changes.
You can get the ARN for the Aurora MySQL source on the Configuration tab as shown in the following screenshot.
Update the Redshift Serverless workgroup to enable case-sensitive identifiers
Aurora MySQL by default is case-sensitive, and it is disabled on all provisioned clusters and Redshift Serverless workgroups by default. For the integration to be successful, the case sensitivity parameter (enable_case_sensitive_identifier) must be enabled for the data warehouse.
To enable case sensitivity on a provisioned cluster, create a custom parameter group with the enable_case_sensitive_identifier
parameter enabled. Then associate parameter group with the cluster.
For Redshift Serverless workgroup, you must use the AWS Command Line Interface (AWS CLI) to run the update-workgroup action:
You can use AWS CloudShell or another interface like Amazon Elastic Compute Cloud (Amazon EC2) with an AWS user configuration that can update the Redshift Serverless parameter group. The following screenshot illustrates how to run this on CloudShell.
The following screenshot shows how to run the update-workgroup
command on Amazon EC2.
Configure required permissions
To create a zero-ETL integration, your user or role must have an attached identity-based policy with the appropriate AWS Identity and Access Management (IAM) permissions. An AWS account owner can configure required permissions for user or roles who may create zero-etl integrations. The sample policy allows the associated principal to perform following actions:
- Create zero-ETL integrations for the source Aurora DB cluster.
- View and delete all zero-ETL integrations.
- Create inbound integrations into the target data warehouse. This permission is not required if the same account owns the Amazon Redshift data warehouse and this account is an authorized principal for that data warehouse. Also note that Amazon Redshift has a different ARN format for provisioned and serverless:
- Provisioned cluster –
arn:aws:redshift:{region}:{account-id}:namespace:{namespace-uuid}
- Serverless –
arn:aws:redshift-serverless:{region}:{account-id}:namespace/{namespace-uuid}
- Provisioned cluster –
Complete the following steps to configure the permissions:
- On the IAM console, choose Policies in the navigation pane.
- Choose Create policy.
- Create a new policy called
rds-integrations
using the following JSON: - Attach the policy you created to your IAM user or role permissions.
Create the zero-ETL integration
To create the zero-ETL integration, complete the following steps:
- On the Amazon RDS console, choose Zero-ETL integrations in the navigation pane.
- Choose Create zero-ETL integration.
- For Integration identifier, enter a name, for example
zero-etl-demo
.
- For Source database, choose Browse RDS databases and choose the source cluster
zero-etl-source-ams
.
- Under Target, for Amazon Redshift data warehouse, choose Browse Redshift data warehouses and choose the Redshift Serverless destination namespace (
zero-etl-target-rs-ns
).
* Specify a different account explained later in this section - Add tags and encryption, if applicable and choose Next.
- Verify the integration name, source, target and other settings. When confirmed, choose Create zero-ETL integration.
You can choose the integration to view the details and monitor its progress. It takes a few minutes to change the status from Creating to Active. The time varies depending on size of the dataset already available in the source.
To *specify a target Amazon Redshift data warehouse that’s in another AWS account, you must create a role that allows users in the current account to access resources in the target account. For more information, refer to Providing access to an IAM user in another AWS account that you own.
Create a role in the target account with the following permissions:
The role must have the following trust policy, which specifies the target account ID. You can do this by creating a role with a trusted entity as an AWS account ID in another account.
The following screenshot illustrates creating this on the IAM console.
Then while creating the zero-ETL integration, choose the destination account ID and the name of the role you created to proceed further, for Specify a different account option.
Create a database from the integration in Amazon Redshift
To create your database, complete the following steps:
- On the Redshift Serverless dashboard, navigate to the
zero-etl-target-rs-ns namespace
. - Choose Query data to open Query Editor v2.
- Connect to the Redshift Serverless data warehouse by choosing Create connection.
- Obtain the
integration_id
from thesvv_integration
system table: - Use the
integration_id
from the previous step to create a new database from the integration:
The integration is now complete, and an entire snapshot of the source will reflect as is in the destination. Ongoing changes will be synced in near-real time.
Analyze the near-real time transactional data
Now we can run analytics on TICKIT’s operational data.
Populate the source TICKIT data
To populate the source data, complete the following steps:
- Connect to your Aurora MySQL cluster and create a database/schema for the TICKIT data model, verify that the tables in that schema have a primary key, and initiate the load process:
You can use the script from the following HTML file to create the sample database demodb
(using the tickit.db model) in Amazon Aurora MySQL-Compatible edition.
- Run the script to create the tickit.db model tables in the
demodb
database/schema:
- Load data from Amazon Simple Storage Service (Amazon S3), record the finish time for change data capture (CDC) validations at destination, and observe how active the integration was.
The following are common errors associated with load from Amazon S3:
- For the current version of the Aurora MySQL cluster, we need to set the
aws_default_s3_role
parameter in the DB cluster parameter group to the role ARN that has the necessary Amazon S3 access permissions.
- If you get an error for missing credentials (for example,
Error 63985 (HY000): S3 API returned error: Missing Credentials: Cannot instantiate S3 Client)
, you probably haven’t associated your IAM role to the cluster. In this case, add the intended IAM role to the source Aurora MySQL cluster.
Analyze the source TICKIT data in the destination
On the Redshift Serverless dashboard, open Query Editor v2 using the database you created as part of the integration setup. Use the following code to validate the seed or CDC activity:
Choose the cluster or workgroup and database created from integration on the drop-down menu and run tickit.db sample analytic queries.
Monitoring
You can query the following system views and tables in Amazon Redshift to get information about your Aurora zero-ETL integrations with Amazon Redshift:
- SVV_INTEGRATION – Provides configuration details for your integrations.
- SYS_INTEGRATION_ACTIVITY – Provides information about completed integration runs.
- SVV_INTEGRATION_TABLE_STATE – Describes the table level integration information.
In order to view the integration-related metrics published to Amazon CloudWatch, navigate to Amazon Redshift console. Choose Zero-ETL integrations from left navigation pane and click on the integration links to display activity metrics.
Available metrics on the Redshift console are Integration metrics and table statistics, with table statistics providing details of each table replicated from Aurora MySQL to Amazon Redshift.
Integration metrics contains table replication success/failure counts and lag details:
Clean up
When you delete a zero-ETL integration, your transactional data isn’t deleted from Aurora or Amazon Redshift, but Aurora doesn’t send new data to Amazon Redshift.
To delete a zero-ETL integration, complete the following steps:
- On the Amazon RDS console, choose Zero-ETL integrations in the navigation pane.
- Select the zero-ETL integration that you want to delete and choose Delete.
- To confirm the deletion, choose Delete.
Conclusion
In this post, we showed you how to set up Aurora zero-ETL integration from Amazon Aurora MySQL-Compatible Edition to Amazon Redshift. This minimizes the need to maintain complex data pipelines and enables near-real time analytics on transactional and operational data.
To learn more about Aurora zero-ETL integration with Amazon Redshift, visit documentation for Aurora and Amazon Redshift.
About the Authors
Rohit Vashishtha is a Senior Analytics Specialist Solutions Architect at AWS based in Dallas, Texas. He has 17 years of experience architecting, building, leading, and maintaining big data platforms. Rohit helps customers modernize their analytic workloads using the breadth of AWS services and ensures that customers get the best price/performance with utmost security and data governance.
Vijay Karumajji is a Database Solutions Architect with Amazon Web Services. He works with AWS customers to provide guidance and technical assistance on database projects, helping them improve the value of their solutions when using AWS.
BP Yau is a Sr Partner Solutions Architect at AWS. He is passionate about helping customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate its Oracle data warehouse to Amazon Redshift and build its next generation big data analytics platform using AWS technologies.
Jyoti Aggarwal is a Product Management lead for Amazon Redshift zero-ETL. She leads the product and go-to-market strategy for it including driving initiatives around performance, customer experience and observability. She brings along an expertise in cloud compute and storage, data warehouse, and B2B/B2C customer experience.
Adam Levin is a Product Manager on the Amazon Aurora team based in California. He has spent the last 10 years working on various cloud database services.