Modernize your legacy databases with AWS data lakes, Part 3: Build a data lake processing layer

This is the final part of a three-part series where we show how to build a data lake on AWS using a modern data architecture. This post shows how to process data with Amazon Redshift Spectrum and create the gold (consumption) layer. To review the first two parts of the series where we load data from SQL Server into Amazon Simple Storage Service (Amazon S3) using AWS Database Migration Service (AWS DMS) and load the data into the silver layer of the data lake, see the following:

Solution overview

Choosing the right tools and technology stack to build the data lake in order to build a scalable solution and have shorter time to market is critical. In this post, we go over the process of building a data lake, providing rationale behind the different decisions, and share best practices when building such a data solution.

The following diagram illustrates the different layers of the data lake.

The data lake is designed to serve a multitude of use cases. In the silver layer of the data lake, the data is stored as it is loaded from sources, preserving the table and schema structure. In the gold layer, we create data marts by combining, aggregating, and enriching data as required by our use cases. The gold layer is the consumption layer for the data lake. In this post, we describe how you can use Redshift Spectrum as an API to query data.

To create data marts, we use Amazon Redshift Query Editor. It provides a web-based analyst workbench to create, explore, and share SQL queries. In our use case, we use Redshift Query Editor to create data marts using SQL code. We also use Redshift Spectrum, which allows you to efficiently query and retrieve structured and semi-structured data from files stored on Amazon S3 without having to load the data into the Redshift tables. The Apache Iceberg tables, which we created and cataloged in Part 2, can be queried using Redshift Spectrum. For the latest information on Redshift Spectrum integration with Iceberg, see Using Apache Iceberg tables with Amazon Redshift.

We also show how to use RedshiftDataAPIService to run SQL commands to query the data mart using a Boto3 Python SDK. You can use the Redshift Data API to create the resulting datasets on Amazon S3, and then use the datasets in use cases such as business intelligence dashboards and machine learning (ML).

In this post, we walk through the following steps:

Set up a Redshift cluster.
Set up a data mart.
Query the data mart.

Prerequisites

To follow the solution, you need to set up certain access rights and resources:

An AWS Identity and Access Management (IAM) role for the Redshift cluster with access to an external data catalog in AWS Glue and data files in Amazon S3 (these are the data files populated by the silver layer in Part 2). The role also needs Redshift cluster permissions. This policy must include permissions to do the following:
- Run SQL commands to copy, unload, and query data with Amazon Redshift.
- Grant permissions to run SELECT statements for related services, such as Amazon S3, Amazon CloudWatch logs, Amazon SageMaker, and AWS Glue.
- Manage AWS Lake Formation permissions (in case the AWS Glue Data Catalog is managed by Lake Formation).
An IAM execution role for AWS Lambda with permissions to access Amazon Redshift and AWS

For more information about setting up IAM roles for Redshift Spectrum, see Getting started with Amazon Redshift Spectrum.

Set up a Redshift cluster

Redshift Spectrum is a feature of Amazon Redshift that queries data stored in Amazon S3 directly, without having to load it into Amazon Redshift. In our use case, we use Redshift Spectrum to query Iceberg data stored as Parquet files on Amazon S3. To use Redshift Spectrum, we first need a Redshift cluster to run the Redshift Spectrum compute jobs. Complete the following steps to provision a Redshift cluster:

On the Amazon Redshift console, choose Clusters in the navigation pane.
Choose Create cluster.
For Cluster identifier, enter a name for your cluster.
For Choose the size of the cluster, select I’ll choose.
For Node type, choose ra3.xlplus.
For Number of nodes, enter 1.

can

For Admin password, select Manage admin credentials in AWS Secrets Manager if you want to use Secrets Manager, otherwise you can generate and store the credentials manually.

For the IAM role, choose the IAM role created in the prerequisites.
Choose Create cluster.

We chose the cluster Availability Zone, number of nodes, compute type, and size for this post to minimize costs. If you’re working on larger datasets, we recommend reviewing the different instance types offered by Amazon Redshift to select the one that is appropriate for your workloads.

Set up a data mart

A data mart is a collection of data organized around a specific business area or use case, providing focused and quickly accessible data for analysis or consumption by applications or users. Unlike a data warehouse, which serves the entire organization, a data mart is tailored to the specific needs of a particular department, allowing for more efficient and targeted data analysis. In our use case, we use data marts to create aggregated data from the silver layer and store it in the gold layer for consumption. For our use case, we use the schema HumanResources in the AdventureWorks sample database we loaded in Part 1. This database contains a factory’s employee shift information for different departments. We use this database to create a summary of the shift rate changes for different departments, years, and shifts to see which years had the most rate changes.

We recommend using the auto mount feature in Redshift Spectrum. This feature removes the need to create an external schema in Amazon Redshift to query tables cataloged in the Data Catalog.

Complete the following steps to create a data mart:

On the Amazon Redshift console, choose Query editor v2 in the navigation pane.
Choose the cluster you created and choose AWS Secrets Manager or Database username and password depending on how you chose to store the credentials.
After you’re connected, open a new query editor.

You will be able to see the AdventureWorks database under awsdatacatalog. You can now start querying the Iceberg database in the query editor.

query-editor

If you encounter permission issues, choose the options menu (three dots) next to the cluster, choose Edit connection, and connect using Secrets Manager or your database user name and password. Then grant privileges for the IAM user or role with the following command, and reconnect with your IAM identity:

GRANT USAGE ON DATABASE awsdatacatalog to "IAMR:{YOUR IAM ROLE NAME}"

For more information, see Querying the AWS Glue Data Catalog.

Next, you create a local schema to store the definition and data for the view.

On the Create menu, choose Schema.
Provide a name and set the type as local.

For the data mart, create a dataset that combines different tables in the silver layer to generate a report of the total shift rate changes by department, year, and shift. The following SQL code will return the required dataset:

SELECT dep.name AS "Department Name", 
       extract(YEAR FROM emp_pay_hist.ratechangedate) AS "Rate Change Year", 
       shift.name AS "Shift",COUNT(emp_pay_hist.rate) AS "Rate Changes"
FROM "dev"."{redshift_schema_name}"."department" dep 

INNER JOIN "dev"."{redshift_schema_name}"."employeedepartmenthistory" emp_hist 
ON dep.departmentid = emp_hist.departmentid 

INNER JOIN "dev"."{redshift_schema_name}"."employeepayhistory" emp_pay_hist 
ON emp_pay_hist.businessentityid = emp_hist.businessentityid 

INNER JOIN "dev"."{redshift_schema_name}"."employee" emp 
ON emp_hist.businessentityid = emp.businessentityid 

INNER JOIN "dev"."{redshift_schema_name}"."shift" shift 
ON emp_hist.shiftid = shift.shiftid WHERE emp.currentflag = 'true'

GROUP BY dep.name, extract(YEAR FROM emp_pay_hist.ratechangedate), shift.name;

Create an internal schema where you want Amazon Redshift to store the view definition:
```
CREATE SCHEMA IF NOT EXISTS {internal_schema_name};
```

Create a view in Amazon Redshift that you can query to get the dataset:

CREATE OR REPLACE VIEW {internal_schema_name}.rate_changes_by_department_year AS
SELECT dep.name AS "Department Name",
        extract(YEAR FROM emp_pay_hist.ratechangedate) AS "Rate Change Year",
        shift.name AS "Shift",
        COUNT(emp_pay_hist.rate) AS "Rate Changes"
FROM "dev"."{redshift_schema_name}"."department" dep

INNER JOIN "dev"."{redshift_schema_name}"."employeedepartmenthistory" emp_hist
ON dep.departmentid = emp_hist.departmentid

INNER JOIN "dev"."{redshift_schema_name}"."employeepayhistory" emp_pay_hist
ON emp_pay_hist.businessentityid = emp_hist.businessentityid

INNER JOIN "dev"."{redshift_schema_name}"."employee" emp
ON emp_hist.businessentityid = emp.businessentityid

INNER JOIN "dev"."{redshift_schema_name}"."shift" shift
ON emp_hist.shiftid = shift.shiftid

WHERE emp.currentflag = 'true'

GROUP BY dep.name, extract(YEAR FROM emp_pay_hist.ratechangedate), shift.name
WITH NO SCHEMA BINDING;

If the SQL takes a long time to run or produces a large result set, consider using Redshift Unlike regular views, which are computed in the moment, the results from materialized views can be pre-computed and stored on Amazon S3. When the data is requested, Amazon Redshift can point to an Amazon S3 location where the results are stored. Materialized views can be refreshed on demand and on a schedule.

Query the data mart

Lastly, we query the data mart using a Lambda function to show how the data can be retrieved using an API. The Lambda function requires an IAM role to access Secrets Manager where the Redshift user credentials are stored. We use the Redshift Data API to retrieve the dataset we created in the previous step. First, we call the execute_statement() command to run the view. Next , we check the status of the run by calling the describe_statement() call. Finally , when the statement has successfully run, we use the get_statement_result() call to get the result set. The Lambda function shown in the following code implements this logic and returns the result set from querying the view rate_changes_by_department_year:

import json
import boto3
import time


def lambda_handler(event, context):

    client = boto3.client("redshift-data")

    # Use the RedShift execute statament api to query the data mart
    response = client.execute_statement(
        ClusterIdentifier=event["cluster_id"],
        Database=event["db"],
        SecretArn=event["secret_arn"],
        Sql=f"select * from {event['db']}.rate_changes_by_department_year",
        StatementName="query data mart",
    )

    statement_id = response["Id"]
    query_status = True
    resultSet = []

    # Check the status of the sql statement, once the statement has finished executing we can retrive the resultset
    while query_status:
        if client.describe_statement(Id=statement_id)["Status"] == "FINISHED":

            print(
                "SQL statement has finished successfully and we can get the resultset"
            )

            response = client.get_statement_result(Id=statement_id)
            columns = response["ColumnMetadata"]
            results = response["Records"]
            while "NextToken" in response:
                response = client.get_servers(NextToken=response["NextToken"])
                results.extend(response["Records"])

            resultSet.append(
                str(columns[0].get("label"))
                + ","
                + str(columns[1].get("label"))
                + ","
                + str(columns[2].get("label"))
                + ","
                + str(columns[3].get("label"))
            )

            for result in results:
                resultSet.append(
                    str(result[0].get("stringValue"))
                    + ","
                    + str(result[1].get("longValue"))
                    + ","
                    + str(result[2].get("stringValue"))
                    + ","
                    + str(result[3].get("longValue"))
                )

            query_status = False

        # In case the statement runs into errors we abort the resultset retrival
        if (
            client.describe_statement(Id=statement_id)["Status"] == "ABORTED"
            or client.describe_statement(Id=statement_id)["Status"] == "FAILED"
        ):
            query_status = False
            print("SQL statement has failed or aborted")

        # To aviod spamming the API with requests on the status of the statement, we introduce a 2 second wait between calls
        else:
            print(
                "Query Status ::" + client.describe_statement(Id=statement_id)["Status"]
            )
            time.sleep(2)

    return {"statusCode": 200, "body": resultSet}

The Redshift Data API allows you to access data from many different types of traditional, cloud-based, containerized, web service-based, and event-driven applications. The API is available in many programming languages and environments supported by the AWS SDK, such as Python, Go, Java, Node.js, PHP, Ruby, and C++. For larger datasets that don’t fit into memory, such as ML training datasets, you can use the Redshift UNLOAD command to move the results of the query to an Amazon S3 location.

Clean up

In this post, you created an IAM role, Redshift cluster, and Lambda function. To clean up your resources, complete the following steps:

Delete the IAM role:
1. On the IAM console, choose Roles in the navigation pane.
2. Select the role and choose Delete.
Delete the Redshift cluster:
1. On the Amazon Redshift console, choose Clusters in the navigation pane.
2. Select the cluster you created and on the Actions menu, choose Delete.
Delete the Lambda function:
1. On the Lambda console, choose Functions in the navigation pane.
2. Select the function you created and on the Actions menu, choose Delete.

Conclusion

In this post, we showed how you can use Redshift Spectrum to create data marts on top of the data in your data lake. Redshift Spectrum can query Iceberg data stored in Amazon S3 and cataloged in AWS Glue. You can create views in Amazon Redshift that compute the results from the underlying data on demand, or pre-compute results and store them (using materialized views). Lastly, the Redshift Data API is a great tool for running SQL queries on the data lake from a wide variety of sources.

For more insights into the Redshift Data API and how to use it, refer to Using the Amazon Redshift Data API to interact with Amazon Redshift clusters. To continue to learn more about building a modern data architecture, refer to Analytics on AWS.

About the Authors

Shaheer Mansoor is a Senior Machine Learning Engineer at AWS, where he specializes in developing cutting-edge machine learning platforms. His expertise lies in creating scalable infrastructure to support advanced AI solutions. His focus areas are MLOps, feature stores, data lakes, model hosting, and generative AI.

Anoop Kumar K M is a Data Architect at AWS with focus in the data and analytics area. He helps customers in building scalable data platforms and in their enterprise data strategy. His areas of interest are data platforms, data analytics, security, file systems and operating systems. Anoop loves to travel and enjoys reading books in the crime fiction and financial domains.

Sreenivas Nettem is a Lead Database Consultant at AWS Professional Services. He has experience working with Microsoft technologies with a specialization in SQL Server. He works closely with customers to help migrate and modernize their databases to AWS.

AWS Big Data Blog