AWS Partner Network (APN) Blog
How Qlik Simplifies Analytics Data Collaboration Through Amazon Redshift Data Sharing
By Sathisan Vannadil, Sr. Partner Solutions Architect – AWS
John Park, Principal Solutions Architect – Qlik
Qlik |
Data is growing at an exponential rate, but as businesses become more interconnected and interdependent the need to share data to remain competitive is more important than ever.
Implementing an effective data sharing strategy that satisfies compliance and regulatory requirements, and data protection standards with clear data ownership, has become more complex.
In this post, we discuss how Qlik integrates with Amazon Redshift to simplify data sharing to ensure efficient collaboration and development of business insights that can drive decision-making.
Qlik Technologies is an AWS Data and Analytics Competency Partner with the Amazon Redshift specialization. Qlik helps enterprises move faster, work smarter, and lead the way forward with an end-to-end solution for getting value out of data.
Optimized for cloud and on-premises deployments, Qlik’s Active Intelligence Platform allows for open-ended, curiosity-driven exploration, giving everyone (at any skill level) the ability to make real discoveries that lead to outcomes and transformative changes.
Amazon Redshift Data Sharing
Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all of your data using standard SQL in addition to your existing extract, transform, and load (ETL), business intelligence (BI), and reporting tools.
It allows you to run complex analytic queries against petabytes of structured data, using sophisticated query optimization to power analytics workloads such as BI, predictive analytics, and real-time streaming analytics.
The Amazon Redshift data sharing capability allows customers to securely and easily share data across two or more Redshift clusters. Data sharing improves business agility by providing instant, granular, and high-performance access to live data without data copies or data movement, thereby providing live and transactionally consistent views of data across all consumers.
With Redshift, you can share data at many levels, including databases, schemas, tables, views, columns, and user-defined functions, to provide fine-grained access controls.
Redshift data sharing builds on RA3 managed storage, which decouples storage and compute and allows them to scale independently. The data sharing “producers” are Redshift clusters you want to share data from, and data sharing “consumers” are Redshift clusters that receive data share from producer clusters.
With data sharing, workloads accessing shared data are isolated from each other. Queries accessing shared data run on the consumer cluster and read from the Redshift managed storage layer directly without impacting the performance of the producer cluster.
Once data is shared, customers can use cross-database query functionality to query the shared databases from the consuming clusters and join the shared data with local data on the consumer cluster.
Why Qlik Customers Need Redshift Data Sharing
The workloads that Qlik users run on Amazon Redshift can be broadly divided into two categories:
- Data ingestion/ETL processing: Qlik connects to data sources such as databases, applications, files, web services, and event streams. It transforms according to user-specified rules, and loads into a customer’s Redshift clusters. The load itself consists of COPY statements, and also typically joins with target tables in order to perform UPDATEs and DELETEs of existing records.
- Analytical queries: These come in many forms such as ad-hoc analyses, internal or customer-facing dashboards, and scheduled reports. These workloads are read-heavy and consist mostly of SELECT queries.
We’ll refer to the first category as ETL workloads, and the second category as BI workloads.
BI workloads require data to be ingested, validated, and loaded into a data warehouse. BI users expect their data to be available and up-to-date, have a low tolerance for data delays, and also cannot miss performance SLAs for critical reporting needs.
In order for data to be up to date, ETL processes that ingest and model data need to be consistently running on time. There are two issues related to this that Amazon Redshift data sharing helps address.
Workload Contention
ETL workloads tend to be predictable in that they are executed on a schedule. Also, with users moving from nightly to continuous data ingestion and loading, ETL workloads need to execute at a regular frequency, sometimes even throughout the day.
BI workloads, on the other hand, tend to be dynamic and variable throughout the day as users perform ad-hoc analysis. When cluster capacity is shared between ETL and BI workloads, ETL workloads can reduce the resources available to BI workloads, adversely impacting query performance and worsening BI users’ experiences.
There may be different teams responsible for ETL processes and BI analysis, and each team may want to control the compute resources they need to provision for their workloads. Having separate clusters between ETL processes and BI workloads helps as each cluster can be owned by the ETL and BI team, respectively.
By using Amazon Redshift data sharing to isolate ETL and BI workloads to separate clusters, BI workloads are not affected by ETL workload resource demands, and BI users have up-to-date data and consistent SLAs for their analyses.
Another benefit of having separate clusters is that ETL and BI cluster capacity can be provisioned according to their workload-specific performance requirements, and can be changed independently as requirements change. Data sharing also enables the ability to offer charge-back capabilities for various workloads and teams.
Table Locking
While Amazon Redshift is designed to enable concurrent reads and writes DDL operations, ETL process might acquire an AccessExclusiveLock on the table in question, which means it cannot be accessed concurrently by ETL and BI workloads which can lead to contention.
Using Amazon Redshift data sharing to isolate BI workloads to a consumer cluster solves this problem. Data sharing provides snapshot isolation so consumers continue to query the data even when DDL is happening in producer cluster.
Also, note there are no data movements as a result of data sharing—live and transactional data is shared in-place through Amazon Redshift managed storage. This is illustrated in the following diagrams.
Figure 1 – Accessing data in place without Amazon Redshift data sharing.
Figure 2 – Sharing data in place through Amazon Redshift managed storage.
Walkthrough
This post covers the following steps:
- Configure Amazon Redshift data sharing using SQL statements:
- Set up the Redshift producer cluster (“qlik-rs-etl-producer”)
- Set up the Redshift consumer cluster (“qlik-rs-bi-consumer”)
- From Qlik Cloud, set up data sources on the Redshift consumer cluster, and verify the data share is accessible.
- Update the Redshift producer cluster and verify propagation of updates from the Redshift producer cluster to the Redshift consumer cluster.
Configuring Redshift Data Sharing Using SQL statements
In this section, we’ll look at configuring data sharing between two Redshift clusters. We have configured a process to ingest data into our Redshift producer cluster.
The data is stored in the “public” schema in the database “etl” and we want to create a datashare “bi_share” and share all the tables in the “public” schema in the database “etl.”
Figure 3 – Cluster namespaces.
Create a Datashare in the Redshift Producer Cluster
Run the following query on both the clusters and note down the cluster namespaces:
SELECT CURRENT_NAMESPACE;
Let’s assume the results are “<namespace-producer>” and “<namespace-consumer>”, respectively. In the producer cluster, create the datashare called ‘”bi_share” and grant consumer cluster access to it:
CREATE DATASHARE bi_share SET PUBLICACCESSIBLE TRUE;
ALTER DATASHARE bi_share ADD SCHEMA public;
ALTER DATASHARE bi_share ADD ALL TABLES IN SCHEMA public;
GRANT USAGE ON DATASHARE bi_share TO NAMESPACE ‘<namespace-consumer>’;
After the datashare has been created, validate what datashares exist on the cluster using the following commands:
SHOW DATASHARES;
Figure 4 – List available datashares.
The following command shows what objects datashare contains:
SELECT * FROM svv_datashare_objects;
Now, let’s say you don’t want to expose the “consumer” table to the consumer cluster. Remove it from the datashare by executing the following statement:
ALTER DATASHARE bi_share REMOVE TABLE public.consumer;
Map and Query Shared Data at the Redshift Consumer Cluster
At the consumer cluster, execute the following to create a new database “bi” from the “bi_share”:
CREATE DATASHARE bi FROM DATASHARE OF NAMESPACE ‘<namespace-producer>’;
Execute the following commands to test what tables you have access to and you don’t have access to:
SELECT * FROM bi.public.region LIMIT 5;
SELECT * FROM bi.public.customer LIMIT 5;
For the second query, you’ll get the error message that “Relation customer does not exist in the database.” This is because, previously, we removed the “customer” table from the datashare, as we don’t want to expose customer details to the consumer cluster.
Again, to find out what objects datashare contains, issue the following command:
SELECT * FROM svv_datashare_objects;
Figure 5 – List of all datashare objects.
As new data gets loaded to the producer cluster, it will be available in the consumer cluster instantaneously.
In the producer cluster, execute the following statement:
INSERT INTO public.region VALUES(5, ‘New_Region’, ‘Test Region’);
In the consumer cluster, execute the following statement:
SELECT * FROM public.region WHERE r_name = ‘New_Region’;
You will notice that as the “region” table in the producer cluster is updated with a new record, the same will be immediately available in the consumer cluster as well.
In addition to the [db].[schema].[table] notation, you can map a datashare schema to a local schema:
CREATE EXTERNAL SCHEMA bi_ext FROM REDSHIFT DATABASE ‘bi’ SCHEMA ‘public’;
SELECT * FROM bi_ext.orders LIMIT 5;
Figure 6 – Sample data from the EXTERNAL SCHEMA table.
Now, let’s move to Qlik Cloud and examine how we could access data in the consumer cluster to build an analytics application.
Set Up Data Sources on the Redshift Consumer Cluster
To test out Amazon Redshift data sharing in Qlik, login to your Qlik Cloud account.
Figure 7 – Qlik Cloud analytics services.
Click the “+ Add new” button at the top right corner and add a “New analytics app” and name it as “Testing Redshift Data Sharing App.”
Next, select the “Testing Redshift Data Sharing App” and click the “Data manager” option from the top left corner, and add data to your app using the “Files and other sources.”
In the pop-up window, select “Amazon Redshift” as the data source and create a new connection by giving the Host name, Port, Database name, and user credentials of the Amazon Redshift consumer cluster. Finally, “Save” the connection.
Figure 8 – Create a new connection.
Next, click the connection and select it so that we can see what schemas are available. Select the “bi_ext” schema in order to list the available tables in this schema.
Figure 9 – Examine tables in “bi_ext” schema.
Let’s select the “nation,” “region,” and “supplier” tables, and then click ”Next.” Qlik will load data and automatically find “Recommended associations” for the available tables. Preview the recommended associations and click “Apply all.”
Figure 10 – Qlik recommended associations.
Verify Propagation of Updates
Now, to check out the propagation of updates from the Redshift producer cluster to the Redshift consumer cluster, execute the following statement in the producer cluster:
INSERT INTO public.region VALUES(6, ‘New_Region2’, ‘Test Region2’);
Figure 11 – Examine the new record in region table.
As you refresh the Qlik page, you could see the new record shows up momentarily in the “region” table. Note the data is not ingested into the consumer cluster, but simply shared from the Amazon Redshift managed storage layer without any performance impact to the producer cluster.
Conclusion
Simplifying the mechanisms of data sharing across and between organizations is key for success in analytics journey.
In this post, you learned how Amazon Redshift’s data sharing gives you the ability to fan out and scale complex workloads, and how it helps achieve workload isolation and manageability for different organizational teams and needs.
You also learned about the architecture of Amazon Redshift with the data sharing feature enabled, how data sharing can be configured using SQL commands, and how easy it is to leverage the capability within Qlik Cloud for your business intelligence and analytical needs.
To learn more about Qlik, visit AWS Marketplace.
Special thanks to Bosco Albuquerque, Sr. Partner Solutions Architect – AWS
Qlik Technologies – AWS Partner Spotlight
Qlik Technologies is an AWS Competency Partner with the Amazon Redshift specialization. Qlik helps enterprises move faster, work smarter, and lead the way forward with an end-to-end solution for getting value out of data.