AWS Partner Network (APN) Blog
How Matillion Multiplies the Performance of Complex ETL Jobs with Amazon Redshift Materialized Views
By Lee Power, Product Owner at Matillion
By Dilip Rajan, Partner Solution Architect at AWS
In modern business environments and data-driven organizations, decisions are rarely made without insights. These decisions are based on analytical dashboards that provide a point-in-time view of a specific business vertical. Powering these dashboards requires building and maintaining data pipelines with complex business logic.
Amazon Redshift recently announced support for materialized views, which lead to significantly faster query performance on repeatable query workloads. That, in turn, reduces the time to deliver the datasets you need to produce your business insights.
Matillion ETL for Amazon Redshift provides comprehensive enterprise-grade features to simplify and speed up building and maintaining these pipelines. Matillion is an AWS Advanced Technology Partner with the AWS Data & Analytics Competency and Amazon Redshift Ready designation.
By using Matillion ETL with the new materialized views in Amazon RedShift, you can improve the performance of an extract, transform, and load (ETL) job and simplify your data pipeline. We found that job runtimes were consistently 9.75 x faster when using materialized views than when using standard views.
In this post, we’ll show you how to get those results.
Matillion ETL for Amazon Redshift
Matillion ETL uses orchestration jobs to handle data using pre-built connectors for software-as-a-service (SaaS) applications, NoSQL, files, on-premises and cloud databases, as well as from any RESTful API source system.
Once the orchestration job is set up, Matillion ETL first loads and then transforms the data to make it consumable by analytics tools such as Amazon Quicksight, Looker, Tableau, Power BI, and others.
Matillion ETL transforms the data in the same way, regardless of source, by creating stream batches to a staging file in Amazon Simple Storage Service (Amazon S3), and then using the Amazon Redshift copy
command to load the data.
You can launch Matillion ETL for Amazon Redshift either as an Amazon Machine Image (AMI), or by fitting it into your AWS CloudFormation template, which is also available through AWS Quick Starts.
We recommend you launch your Amazon Redshift clusters in the same virtual private cloud (VPC) or region as the Matillion AMI on Amazon Elastic Compute Cloud (Amazon EC2), as shown in Figure 1.
Figure 1 – Matillion ETL for Amazon Redshift architecture.
Detailed setup instructions are available with AWS CloudFormation templates on the Matillion site.
Amazon Redshift Materialized Views
Amazon Redshift materialized views contain precomputed results sets that have been queried from one or more tables. Before materialized views, you would create a temporary table using CTAS (CREATE TABLE AS SELECT)
.
However, as the underlying tables get updated with INSERTS
, UPDATES
, DELETES
, or COPY
from Amazon S3 options, the temporary table would get stale, and you would need to recreate the temporary table to keep the data fresh.
Once you create a materialized view, to get the latest data, you only need to refresh the view. Materialized views refresh much faster than updating a temporary table because of their incremental nature. Amazon Redshift uses only the new data to update the materialized view; it does not update the entire table.
To ensure materialized views are updated with the latest changes, you must refresh the materialized view before executing an ETL script. To automate this process, you can add this REFRESH
command as a part of your ETL script’s initialization:
How to Deploy Matillion ETL with Materialized Views
Let’s begin with the Create View component within a transformation job in the Matillion environment. This component lets you output a view definition to an Amazon Redshift cluster. In some circumstances, this action may be preferable to writing the data to a physical table.
To get started, drag an Input Table component from the Components Panel onto the canvas. Since Matillion ETL is running in your cloud environment, it can read your available tables, which you can easily select from a drop-down.
Now that you have a table, you can drag the Create View component onto the canvas and connect it to the Input Table component.
Figure 2 – Connect Input Table to Create View Component.
You can now configure your component using the Properties pane.
Figure 3 – Configure component properties.
When configuring a component, be sure to set the value for these properties:
- Name – Enter a descriptive name for the component.
. - Schema – Select the table schema. It’s used by the special value
Environment Default
. You can also use multiple schemas.
. - View Name – Provide a name for the view to be created.
. - Late Binding – Select ‘Yes’ to create a late-binding view; the default setting is ‘No.’ A late-binding view does not check underlying database objects, such as tables and other view, until the view is queried. As a result, you can alter or drop the underlying objects without dropping and recreating the view. If you drop underlying objects, queries to the late-binding view fail. If the query to the late-binding view references columns in the underlying object that aren’t present, the query fails.
. - View Type – Select ‘Standard’ or ‘Materialized.’ Replace ‘Standard View’ with ‘Materialized View’ when results aren’t likely to change frequently, and the view takes a long time to compute.
Figure 4 – Select View Type.
When to Use Materialized Views
Since in a materialized view data is pre-computed, querying it is faster than executing the original query. Use materialized views when:
- Query results contain a small number of rows and/or columns relative to the base table.
- Query results contain results that require significant processing.
Within an orchestration job, you can refresh a materialized view by moving the Refresh Materialized View component onto the canvas.
Figure 5 – Drag Refresh Materialized View component into an orchestration job.
Configuring Your Component
You can now configure your component using the Properties pane.
Figure 6 – Configure Refresh Materialized Views properties.
Fill out these fields:
- Name – Enter a descriptive name for the component.
. - Schema – Select the table schema. The special value
Environment Default
uses the schema defined in the environment.
. - View Name – Select one of the existing materialized views. Each materialized view has an owner. You can only select and refresh a view that you own.
Running the job with the configured properties performs a full refresh by re-running the underlying SQL statement, replacing all of the data in the materialized view.
Results
To determine the performance gains when using materialized view over standard view, we set up multiple test cases. For each case, we ran the same job but switched between standard and materialized view.
Test Procedure
You can do the same by following these steps.
- Drag the Input Table component onto the canvas.
. - Provide the required Table Name and Column Names.
. - Validate the job.
.
Note that Matillion ETL allows you to validate jobs as you go. This way, you can gain confidence that, when you run and commit resources, the job will succeed.
. - Drag a Create View component onto the canvas.
. - Set the component parameter to Standard View or Materialized View.
. - Select the Create View component, which is set to Standard.
. - Sample the data.
.
Note that many components allow you to produce a sample output of data up to a specified number of rows. You can use this capability to confirm that a component is set up correctly before using it in a live job.
. - The result appears in the Tasks menu, along with the runtime, as shown in Figure 7.
. - Next, select the Create View component, which is set to Materialized.
. - Re-sample the data.
The result appears in the Tasks menu, along with the runtime.
Figure 7 – Test results comparison.
We found that job runtimes were consistently 9.75 x faster when using materialized views than when using standard views.
Conclusion
Matillion ETL for Amazon Redshift simplifies and improves the performance of your ETL workloads for Amazon Redshift, reducing the time to deliver crucial datasets to operationalize analytics. By using materialized views, you can further improve that performance and simplify your data pipeline.
As an AWS Service Ready partner for Amazon RedShift, Matillion continues to innovate with Amazon Redshift, adopting new features such as shared jobs (pause and resume), and will be rolling out other features soon.
You can get more insight into releases on the Matillion ETL blog or in the Matillion ETL community.
New to Matillion ETL? Check out the free trial on AWS Marketplace.
.
.
Matillion – APN Partner Spotlight
Matillion is an AWS Competency Partner that delivers modern, cloud-native data integration technology designed to solve top business challenges.
Contact Matillion | Solution Overview | AWS Marketplace
*Already worked with Matillion? Rate the Partner
*To review an APN Partner, you must be an AWS customer that has worked with them directly on a project.