Migrate Delta tables from Azure Data Lake Storage to Amazon S3 using AWS Glue

Organizations are increasingly using a multi-cloud strategy to run their production workloads. We often see requests from customers who have started their data journey by building data lakes on Microsoft Azure, to extend access to the data to AWS services. Customers want to use a variety of AWS analytics, data, AI, and machine learning (ML) services like AWS Glue, Amazon Redshift, and Amazon SageMaker to build more cost-efficient, performant data solutions harnessing the strength of individual cloud service providers for their business use cases.

In such scenarios, data engineers face challenges in connecting and extracting data from storage containers on Microsoft Azure. Customers typically use Azure Data Lake Storage Gen2 (ADLS Gen2) as their data lake storage medium and store the data in open table formats like Delta tables, and want to use AWS analytics services like AWS Glue to read the delta tables. AWS Glue, with its ability to process data using Apache Spark and connect to various data sources, is a suitable solution for addressing the challenges of accessing data across multiple cloud environments.

AWS Glue is a serverless data integration service that makes it straightforward to discover, prepare, and combine data for analytics, ML, and application development. AWS Glue custom connectors allow you to discover and integrate additional data sources, such as software as a service (SaaS) applications and your custom data sources. With just a few clicks, you can search for and subscribe to connectors from AWS Marketplace and begin your data preparation workflow in minutes.

In this post, we explain how you can extract data from ADLS Gen2 using the Azure Data Lake Storage Connector for AWS Glue. We specifically demonstrate how to import data stored in Delta tables in ADLS Gen2. We provide step-by-step guidance on how to configure the connector, author an AWS Glue ETL (extract, transform, and load) script, and load the extracted data into Amazon Simple Storage Service (Amazon S3).

Azure Data Lake Storage Connector for AWS Glue

The Azure Data Lake Storage Connector for AWS Glue simplifies the process of connecting AWS Glue jobs to extract data from ADLS Gen2. It uses the Hadoop’s FileSystem interface and the ADLS Gen2 connector for Hadoop. The Azure Data Lake Storage Connector for AWS Glue also includes the hadoop-azure module, which lets you run Apache Hadoop or Apache Spark jobs directly with data in ADLS. When the connector is added to the AWS Glue environment, AWS Glue loads the library from the Amazon Elastic Container Registry (Amazon ECR) repository during initialization (as a connector). When AWS Glue has internet access, the Spark job in AWS Glue can read from and write to ADLS.

With the availability of the Azure Data Lake Storage Connector for AWS Glue in AWS Marketplace, an AWS Glue connection makes sure you have the required packages to use in your AWS Glue job.

For this post, we use the Shared Key authentication method.

Solution overview

In this post, our objective is to migrate a product table named sample_delta_table, which currently resides in ADLS Gen2, to Amazon S3. To accomplish this, we use AWS Glue, the Azure Data Lake Storage Connector for AWS Glue, and AWS Secrets Manager to securely store the Azure shared key. We employed an AWS Glue serverless ETL job, configured with the connector, to establish a connection to ADLS using shared key authentication over the public internet. After the table is migrated to Amazon S3, we use Amazon Athena to query Delta Lake tables.

The following architecture diagram illustrates how AWS Glue facilitates data ingestion from ADLS.

Prerequisites

You need the following prerequisites:

A storage account in Microsoft Azure and your data path in ADLS Gen2. Prepare the storage account credentials in advance. For instructions, refer to Create a storage account shared key.
An AWS Identity and Access Management (IAM) role for the AWS Glue job with the following policies:
- AWSGlueServiceRole, which allows the AWS Glue service role access to related services.
- AmazonEC2ContainerRegistryReadOnly, which provides read-only access to Amazon EC2 Container Registry repositories. This policy is for using AWS Marketplace connector libraries.
- A Secrets Manager policy, which provides read access to the secret in Secrets Manager.
- An S3 bucket policy for the S3 bucket that you need to load ETL data from Data Lake.

Configure your ADLS Gen2 account in Secrets Manager

Complete the following steps to create a secret in Secrets Manager to store the ADLS credentials:

On the Secrets Manager console, choose Store a new secret.
For Secret type, select Other type of secret.
Enter the key accountName for the ADLS Gen2 storage account name.
Enter the key accountKey for the ADLS Gen2 storage account key.
Enter the key container for the ADLS Gen2 container.
Leave the rest of the options as default and choose Next.

Enter a name for the secret (for example, adlstorage_credentials).
Choose Next.
Complete the rest of the steps to store the secret.

Subscribe to the Azure Data Lake Storage Connector for AWS Glue

The Azure Data Lake Storage Connector for AWS Glue simplifies the process of connecting AWS Glue jobs to extract data from ADLS Gen2. The connector is available as an AWS Marketplace offering.

Complete the following steps to subscribe to the connector:

Log in to your AWS account with the necessary permissions.
Navigate to the AWS Marketplace page for the Azure Data Lake Storage Connector for AWS Glue.
Choose Continue to Subscribe.
Choose Continue to Configuration after reading the EULA.

For Fulfilment option, choose Glue 4.0.
For Software version, choose the latest software version.
Choose Continue to Launch.

Create a custom connection in AWS Glue

After you’re subscribed to the connector, complete the following steps to create an AWS Glue connection based on it. This connection will be added to the AWS Glue job to make sure the connector is available and the data store connection information is accessible to establish a network pathway.

To create the AWS Glue connection, you need to activate the Azure Data Lake Storage Connector for AWS Glue on the AWS Glue Studio console. After you choose Continue to Launch in the previous steps, you’re redirected to the connector landing page.

In the Configuration details section, choose Usage instructions.
Choose Activate the Glue connector from AWS Glue Studio.

The AWS Glue Studio console allows the option to either activate the connector or activate it and create the connection in one step. For this post, we choose the second option.

For Connector, confirm Azure ADLS Connector for AWS Glue 4.0 is selected.
For Name, enter a name for the connection (for example, AzureADLSStorageGen2Connection).
Enter an optional description.
Choose Create connection and activate connector.

The connection is now ready for use. The connector and connection information is visible on the Data connections page of the AWS Glue console.

Read Delta tables from ADLS Gen2 using the connector in an AWS Glue ETL job

Complete the following steps to create an AWS Glue job and configure the AWS Glue connection and job parameter options:

On the AWS Glue console, choose ETL jobs in the navigation pane.
Choose Author code with a script editor and choose Script editor.
Choose Create script and go to the Job details section.
Update the settings for Name and IAM role.
Under Advanced properties, add the AWS Glue connection AzureADLSStorageGen2Connection created in previous steps.

For Job parameters, add the key --datalake-formats with the value as delta.

Use the following script to read the Delta table from ADLS. Provide the path to where you have Delta table files in your Azure storage account container and the S3 bucket for writing delta files to the output S3 location.

from pyspark.sql import SparkSession
from delta.tables import *
import boto3
import json

spark = SparkSession.builder.getOrCreate()

sm = boto3.client('secretsmanager')
response = sm.get_secret_value(SecretId="adlstorage_credentials")
value = json.loads(response['SecretString'])
account_name_sparkconfig = f"fs.azure.account.key.{value['accountName']}.dfs.core.windows.net"
account_name = value['accountName']
account_key = value['accountKey']
container_name = value['container']
path = f"abfss://{container_name}@{account_name}.dfs.core.windows.net/path-to-delta-table-files/"
s3DeltaTablePath="s3://yourdatalakebucketname/deltatablepath/"

# Method: Shared Key  
spark.conf.set(account_name_sparkconfig, account_key)

# Read delta table from ADLS gen2 storage
df = spark.read.format("delta").load(path)

# Write delta table to S3 path.
if DeltaTable.isDeltaTable(spark,s3DeltaTablePath):
    s3deltaTable = DeltaTable.forPath(spark,s3DeltaTablePath)
    print("Merge to existing s3 delta table")
    (s3deltaTable.alias("target")
        .merge(df.alias("source"), "target.product_id = source.product_id")
        .whenMatchedUpdateAll()
        .whenNotMatchedInsertAll()
        .execute()
    )
else:
    print("Create delta table to S3.")
    df.write.format("delta").save(s3DeltaTablePath)

Choose Run to start the job.
On the Runs tab, confirm the job ran successfully.
On the Amazon S3 console, verify the delta files in the S3 bucket (Delta table path).
Create a database and table in Athena to query the migrated Delta table in Amazon S3.

You can accomplish this step using an AWS Glue crawler. The crawler can automatically crawl your Delta table stored in Amazon S3 and create the necessary metadata in the AWS Glue Data Catalog. Athena can then use this metadata to query and analyze the Delta table seamlessly. For more information, see Crawl Delta Lake tables using AWS Glue crawlers.

CREATE DATABASE deltadb;

CREATE EXTERNAL TABLE deltadb.sample_delta_table
LOCATION 's3://yourdatalakebucketname/deltatablepath/'
TBLPROPERTIES ('table_type'='DELTA');

12. Query the Delta table:

SELECT * FROM "deltadb"."sample_delta_table" limit 10;

By following the steps outlined in the post, you have successfully migrated a Delta table from ADLS Gen2 to Amazon S3 using an AWS Glue ETL job.

Read the Delta table in an AWS Glue notebook

The following are optional steps if you want to read the Delta table from ADLS Gen2 in an AWS Glue notebook:

Create a notebook and run the following code in the first notebook cell to configure the AWS Glue connection and --datalake-formats in an interactive session:

%idle_timeout 30
%glue_version 4.0
%worker_type G.1X
%number_of_workers 5
%connections AzureADLSStorageGen2Connection
%%configure
{
   "--datalake-formats": "delta"
}

Run the following code in a new cell to read the Delta table stored in ADLS Gen 2. Provide the path to where you have delta files in an Azure storage account container and the S3 bucket for writing delta files to Amazon S3.

from pyspark.sql import SparkSession
from delta.tables import *
import boto3
import json

spark = SparkSession.builder.getOrCreate()

sm = boto3.client('secretsmanager')
response = sm.get_secret_value(SecretId="adlstorage_credentials")
value = json.loads(response['SecretString'])
account_name_sparkconfig = f"fs.azure.account.key.{value['accountName']}.dfs.core.windows.net"
account_name = value['accountName']
account_key = value['accountKey']
container_name = value['container']
path = f"abfss://{container_name}@{account_name}.dfs.core.windows.net/path-to-delta-table-files/"
s3DeltaTablePath="s3://yourdatalakebucketname/deltatablepath/"

# Method: Shared Key  
spark.conf.set(account_name_sparkconfig, account_key)

# Read delta table from ADLS gen2 storage
df = spark.read.format("delta").load(path)

# Write delta table to S3 path.
if DeltaTable.isDeltaTable(spark,s3DeltaTablePath):
    s3deltaTable = DeltaTable.forPath(spark,s3DeltaTablePath)
    print("Merge to existing s3 delta table")
    (s3deltaTable.alias("target")
        .merge(df.alias("source"), "target.product_id = source.product_id")
        .whenMatchedUpdateAll()
        .whenNotMatchedInsertAll()
        .execute()
    )
else:
    print("Create delta table to S3.")
    df.write.format("delta").save(s3DeltaTablePath)

Clean up

To clean up your resources, complete the following steps:

Remove the AWS Glue job, database, table, and connection:
1. On the AWS Glue console, choose Tables in the navigation pane, select sample_delta_table, and choose Delete.
2. Choose Databases in the navigation pane, select deltadb, and choose Delete.
3. Choose Connections in the navigation pane, select AzureADLSStorageGen2Connection, and on the Actions menu, choose Delete.
On the Secrets Manager console, choose Secrets in the navigation pane, select adlstorage_credentials, and on the Actions menu, choose Delete secret.
If you are no longer going to use this connector, you can cancel the subscription to the connector:
1. On the AWS Marketplace console, choose Manage subscriptions.
2. Select the subscription for the product that you want to cancel, and on the Actions menu, choose Cancel subscription.
3. Read the information provided and select the acknowledgement check box.
4. Choose Yes, cancel subscription.
On the Amazon S3 console, delete the data in the S3 bucket that you used in the previous steps.

You can also use the AWS Command Line Interface (AWS CLI) to remove the AWS Glue and Secrets Manager resources. Remove the AWS Glue job, database, table, connection, and Secrets Manager secret with the following command:

aws glue delete-job —job-name <your_job_name>
aws glue delete-connection —connection-name <your_connection_name>
aws secretsmanager delete-secret —secret-id <your_secretsmanager_id>
aws glue delete-table --database-name deltadb --name sample_delta_table
aws glue delete-database --name deltadb

Conclusion

In this post, we demonstrated a real-world example of migrating a Delta table from Azure Delta Lake Storage Gen2 to Amazon S3 using AWS Glue. We used an AWS Glue serverless ETL job, configured with an AWS Marketplace connector, to establish a connection to ADLS using shared key authentication over the public internet. Additionally, we used Secrets Manager to securely store the shared key and seamlessly integrate it within the AWS Glue ETL job, providing a secure and efficient migration process. Lastly, we provided guidance on querying the Delta Lake table from Athena.

Try out the solution for your own use case, and let us know your feedback and questions in the comments.

About the Authors

Nitin Kumar is a Cloud Engineer (ETL) at Amazon Web Services, specialized in AWS Glue. With a decade of experience, he excels in aiding customers with their big data workloads, focusing on data processing and analytics. He is committed to helping customers overcome ETL challenges and develop scalable data processing and analytics pipelines on AWS. In his free time, he likes to watch movies and spend time with his family.

Shubham Purwar is a Cloud Engineer (ETL) at AWS Bengaluru, specialized in AWS Glue and Amazon Athena. He is passionate about helping customers solve issues related to their ETL workload and implement scalable data processing and analytics pipelines on AWS. In his free time, Shubham loves to spend time with his family and travel around the world.

Pramod Kumar P is a Solutions Architect at Amazon Web Services. With 19 years of technology experience and close to a decade of designing and architecting connectivity solutions (IoT) on AWS, he guides customers to build solutions with the right architectural tenets to meet their business outcomes.

Madhavi Watve is a Senior Solutions Architect at Amazon Web Services, providing help and guidance to a broad range of customers to build highly secure, scalable, reliable, and cost-efficient applications on the cloud. She brings over 20 years of technology experience in software development and architecture and is data analytics specialist.

Swathi S is a Technical Account Manager with the Enterprise Support team in Amazon Web Services. She has over 6 years of experience with AWS on big data technologies and specializes in analytics frameworks. She is passionate about helping AWS customers navigate the cloud space and enjoys assisting with design and optimization of analytics workloads on AWS.

AWS Big Data Blog