Microsoft Workloads on AWS
How to copy data from Azure Blob Storage to Amazon S3 using code
Our customers tell us they want a reliable way to copy data from Azure Blob Storage to Amazon Simple Storage Service (Amazon S3). Sometimes data needs to be moved just once, like during a migration. While other times, it needs to be copied continuously, like in a data pipeline. What’s common amongst these requests is that customers want an efficient, high performance, and cost-effective solution.
In this blog post, we will show you how to build a secure, lightweight, and serverless application for copying data from Azure Blob Storage to Amazon S3. This solution provides a seamless and efficient way to transfer data and allows for programmatic integration with other applications. We’ll walk you through the design and share the code so you can deploy this in your own AWS account. The code is provided as both AWS CloudFormation and Terraform templates.
Solution overview
The design uses AWS Lambda to build a serverless solution. Communication between AWS Lambda functions and Azure uses the Azure SDK for Python. To support this communication, two libraries are required: one for handling identity and another for managing storage. These libraries are deployed as separate Lambda layers, allowing them to be easily reused in different solutions.
Each Lambda function has a specific job, which improves scalability, performance, and resilience. Lambda functions interact with one another using Amazon Simple Notification Service (Amazon SNS) for a publish-subscribe messaging pattern.
To trigger the Lambda functions on a daily basis, Amazon EventBridge employs an EventBridge rule. The schedule can be modified to suit your specific requirements.
The sequence of events is illustrated in Figure 1, with arrows indicating the direction of invocation. Table 1 describes the order of events.
Figure 1: AWS services and order of events
Step | Description |
1 | An EventBridge scheduled task triggers a Lambda function to start the data copy process. |
2 | Lambda functions query AWS Secrets Manager for Azure authentication credentials and setup parameters. |
3 | Lambda functions communicate with each other using SNS topics. |
4 | Lambda functions request an OAUTH token from Azure Active Directory over HTTPS. |
5 | Lambda functions interact with Azure Blob Storage over HTTPS. |
6 | Lambda functions upload data to an Amazon S3 bucket. |
Table 1: Order of events
Azure Blob Storage public endpoints are encrypted using Transport Layer Security (TLS) 1.2 by default. This ensures data is encrypted in transit. As of January 5, 2023, all new objects uploaded to Amazon S3 are automatically encrypted with Amazon S3 managed keys (SSE-S3). In this solution, we employ server-side encryption for the bucket using AWS Key Management Service (AWS KMS).
Access to Azure blobs from an application service principal is authorized using Azure Active Directory. The Azure application ID, Tenant ID and Client Secret are securely stored in AWS Secrets Manager and retrieved by Lambda functions as needed.
AWS Identity and Access Management (IAM) is a key component of any AWS solution. Each AWS service must only be allowed to do what it needs to do in the design. We have refined permissions to apply the principle of least privilege.
The solution can be controlled using two parameters. The first parameter, isactive, enables or disables the copy process. The second parameter, begindate, permits the modification of the minimum age requirement for the objects to be copied. To simplify the design, these parameters are stored in AWS Secrets Manager. Lambda functions can efficiently retrieve all the required data by interacting solely with AWS Secrets Manager, with a minimal impact to cost ($.80 at the time of writing).
To increase observability, an Amazon CloudWatch dashboard combines metrics and logs into one view, enabling quick identification of issues.
The copying process is optimized based on the size of the blob being copied. For blobs smaller than 100MB, Lambda functions 1-3 are used, while for objects exceeding 100MB in size, Lambda functions 1-6 are employed, as illustrated in Figure 2.
Figure 2: Amazon S3 Object routing
Figure 3 illustrates the logic implemented in the Lambda functions and Table 2 describes the purpose of each function.
Figure 3: Lambda function logic
Lambda | Purpose |
Lambda01 (launch-qualification) | Check to see if the copy process should run. |
Lambda02 (find-blobs) | Find Azure blobs that need to be copied. |
Lambda03 (download) | Downloads Azure blobs that are less than 100MB in size. |
Lamda04 (largefile-initializer) | Started for Azure blobs that are greater than 100MB in size. Creates a manual multipart upload manifest in JSON. |
Lambda05 (largefile-parter) | Downloads blobs by byte range as a transfer to data stream. |
Lambda06 (largefile-recombinator) | Combines multiparts into Amazon S3 objects |
Table 2: Lambda function purpose
We’ve omitted some aspects of the design from the above diagram and table to keep the overview concise. For example, we added an Amazon SNS dead-letter queue to help with troubleshooting failures, which is not depicted in Figure 2.
In the next section, we will show you how to deploy the solution.
Prerequisites
- Access to an AWS Account with sufficient permissions to create the resources in Figure 1.
- An Azure Storage Account and blob storage container that will serve as the source for the data transfer.
- Azure Active Directory application and service principal. Make a note of the Application secret, which is only displayed during setup.
- The service principal must be assigned Storage Blob Data Contributor and Storage Queue Data Contributor roles scoped to the Storage Account.
- The Azure Active Directory application’s Application ID and Tenant ID.
Note: Copying files from Azure blob storage to Amazon S3 incurs Azure data egress charges.
Walkthrough
The code for this blog post is available in an AWS Samples Git repository. In the following walkthrough, we’ll focus on an AWS CloudFormation deployment. This exercise will take about 20 minutes to complete.
To deploy the AWS CloudFormation Stack:
- Download the source code and extract locally. The archive contains a set of folders and files. You don’t need to extract individual zip files within the folders.
- Upload files from the AzureblobtoAmazonS3copy/CFN folder to an Amazon S3 bucket of your choice. This will be your artifact repository. The folder contains the following files:
azure-arm-identity.zip Lambda layer for Azure identity
azure-arm-storage.zip Lambda layer for Azure storage
azs3copy-lambda01.zip – azs3copy-lambda06.zip Azure blob copy Lambda functions
azs3copy-stack.yaml AWS CloudFormation template
- Once the files have been uploaded, copy the object URL for azs3copy-stack.yaml, as shown in Figure 4:
Figure 4: Retrieving Amazon S3 Object URL location
- Follow this documentation to create a new AWS CloudFormation stack. When choosing a stack template, use the Amazon S3 Object URL you copied in step 2.
- Configure AWS CloudFormation deployment parameters. Parameter descriptions and examples are included in the CloudFormation template. Don’t change the Advanced Settings – these are currently experimental.
- The deployment will take 5-10 minutes. When finished, a CREATE_COMPLETE message is displayed as depicted in Figure 5.
Figure 5: AWS CloudFormation stack complete message
The solution is now set up and ready. It will run automatically based on the schedule you specified in the parameters. If you wish to initiate a copy immediately, proceed to the next section.
Initialize a manual copy of data
- Browse to the AWS Secrets Manager console and choose the secret created by the deployment. In Figure 6 you can see we used “etl” as the prefix code.
Figure 6: Locating secret in AWS Secrets Manager
- Choose Retrieve secret value as shown in Figure 7.
Figure 7: Location retrieve secret value button
- Notice the key for isactive. Changing the value to False will disable the copy process. Review the key for begindate as illustrated in Figure 8. This is the minimum age of the objects to be copied. Make sure it suits your needs.
Figure 8: AWS Secret Manager secret configuration
- Browse to the AWS Lambda
- Search for the Lambda Functions created by the deployment. In Figure 9, you can see we used “etl” as the prefix code.
Figure 9: Selecting solutions Lambda Functions
- Choose the function with lambda01 in the name as shown in Figure 10. The Python code will be visible in the console.
Figure 10: Selecting lambda01
- Select the Test tab and choose Test as illustrated in Figure 11.
Figure 11: Viewing Lambda Function Python code
- You should receive an Execution result: succeeded message as depicted in Figure 12.
Figure 12: Lambda Function execution succeeded
- Navigate to the Amazon S3 bucket created by the deployment. The bucket name will start with the AWS CloudFormation stack name as illustrated in Figure 13. Files will start to appear in the directory. Refresh the page to check progress.
Figure 13: Locating copied objects in Amazon S3
Troubleshooting
If there are no files in the destination bucket, you can check the Amazon CloudWatch dashboard for errors.
- Browse to the Amazon CloudWatch.
- Select Dashboards on the left and choose the dashboard created by the deployment as depicted in Figure 14. In this example, we used “etl” as the prefix code.
Figure 14: Locating Amazon CloudWatch dashboard
- A dashboard is displayed as shown in Figure 15. Review the data in the widgets to identify any potential errors, such as incorrect Azure application secrets or insufficient permissions to the Azure Blob Storage account.
Figure 15: A view of the Amazon CloudWatch dashboard
In the example above, we copied 76 files, for a total of 122.11GB in 7.32 seconds.
Cleanup
Before deleting the AWS CloudFormation stack, you will need to empty the Amazon S3 bucket.
- Browse to Amazon S3 in the AWS Management Console.
- Choose the Amazon S3 bucket created during the deployment and check its contents. You may wish to keep the data for future use. If you wish to retain the files, move the folder to another Amazon S3 bucket.
- When you’re ready, select the Amazon S3 bucket and choose Empty as illustrated in Figure 16.
Figure 16: Emptying Amazon S3 bucket
- Follow the on-screen instructions to confirm the action.
- Finally, follow this documentation to delete the AWS CloudFormation template.
Conclusion
In this blog post, we showed you how you can use AWS services to build a serverless application that copies data from Azure Blob Storage to Amazon S3 in an efficient and cost-effective manner.
AWS has significantly more services, and more features within those services, than any other cloud provider, making it faster, easier, and more cost effective to move your existing applications to the cloud and build nearly anything you can imagine. Give your Microsoft applications the infrastructure they need to drive the business outcomes you want. Visit our .NET on AWS and AWS Database blogs for additional guidance and options for your Microsoft workloads.
Contact us to start your migration and modernization journey today.