AWS Storage Blog
Connect Snowflake to S3 Tables using the SageMaker Lakehouse Iceberg REST endpoint
Organizations today seek data analytics solutions that provide maximum flexibility and accessibility. Customers need their data to be readily available using their preferred query engines, and break down barriers across different computing environments. At the same time, they want a single copy of data to be used across these solutions, to track lineage, be cost effective, and scale better. The rich ecosystem of tools available to query tabular data at scale has quickly made Apache Iceberg a popular choice among organizations.
Snowflake makes enterprise AI easy, connected and trusted. Thousands of companies around the globe, including hundreds of the world’s largest, use Snowflake’s AI Data Cloud to share data, build applications, and power their business with AI. In June 2024, Snowflake announced general availability for Iceberg tables, bringing the platform’s performance and simplicity to the open table format.
The recent developments from Snowflake and Amazon S3 Tables with built-in Iceberg support provide organizations with a powerful new approach to data lake management. In Snowflake, you can leverage the platform’s elastic and performant compute, use query semantics of typical Snowflake tables, and interact with a single copy of data in S3 Tables that is interoperable across computing environments. At the storage layer, S3 Tables introduce purpose-built optimizations for Iceberg that deliver improved performance, simplified security controls, and automatic table maintenance right out of the box. This approach gives organizations a more streamlined method to leverage open data standards, combining the strengths of Snowflake’s data processing capabilities with S3 Tables.
In this post, we will walk through setting up Snowflake to access S3 Tables using the Amazon SageMaker Lakehouse Iceberg REST endpoint with AWS Lake Formation managing storage credential vending. We start by using Amazon Athena to create and add data into the table, and Lake Formation for access control. Next, we will walk through setting up Snowflake to register an external catalog and query the table.
Note: At the time of publishing this blog post, the credential vending feature offered by Snowflake is in preview. Additionally, external catalogs are offered in a read-only mode by Snowflake. The scope of these features are subject to change in the future.
Solution Overview
This post will use the catalog integration for the SageMaker Lakehouse Iceberg REST endpoint with Signature Version 4 (SigV4) authentication in Snowflake.
Prerequisites
- A Snowflake account.
- An S3 table bucket with the AWS analytics integration enabled. For this post, we’ve named our bucket, “s3tables-snowflake-integration”.
- An AWS Identity and Access Management (IAM) role that is a Lake Formation data lake administrator in the Data Catalog account. For instructions, go to Create a data lake administrator.
Solution walkthrough
Part 1. Setup a namespace, table, and load data
Step 1: Log in to the Amazon S3 console and choose Table buckets from the navigation panel. Select the “s3tables-snowflake-integration” bucket.
Step 2: Select Create table with Athena.
Step 3: Create a namespace called “testnamespace”.
Step 4: On namespace creation, you should see this screen. Select Create table with Athena.
Step 5: Create a table “daily_sales” through Athena.
Step 6: Insert sample rows into the “daily_sales” table using Athena.
Next, we walk through the configuration needed for Snowflake to access this table.
Part 2. Setup IAM role for Snowflake to access S3 Tables through AWS Lake Formation
First, we create an IAM role that Snowflake will assume to access AWS Glue and Lake Formation APIs. To do this we create the following policy and role:
Step 1: Create the policy and name it “irc-glue-lf-policy”. Here are some steps to do it through the AWS Management Console:
1.1. Open the IAM console.
1.2. In the navigation pane of the console, choose Policies, and choose the Create Policy option.
1.3. In the policy editor choose JSON and paste the following policies.
1.3.1. Replace <region>
, <account-id>
, <s3_table_bucket_name>
, and <database_name>
in the following policy with your values. We use “myblognamespace” as the database name in the rest of this post.
Step 2: Create a role named “snowflake_access_role” by following these steps in the IAM console.
2.1. In the navigation pane of the console, choose Roles and choose the Create role option.
2.2. Choose AWS account.
2.3. Choose Next and choose the policy you previously created in Step 1, named “irc-glue-lf-policy”.
2.4. Choose Next and enter “snowflake_access_role” as the role name.
2.5. Choose Create role.
2.6. The trust relationship for this role will be updated later. Once you created the role, you need to define access to this role using Lake Formation.
Note: If you use encryption for AWS Glue, you must modify the policy to add AWS Key Management Service (AWS KMS) permissions. For more information, see Setting up encryption in AWS Glue.
Part 3. Define access control using Lake Formation
Step 1: We first start with the Application integration setup, which allows third-party engines to access S3 tables. From the Lake Formation console, enable full table access for external engines to access data.
1.1. Sign in as an data lake admin user and go to AWS Lake Formation.
1.2. On the left pane, expand the Administration section
1.3. Choose Application integration settings and select Allow external engines to access data in Amazon S3 locations with full table access.
1.4. Click Save.
Step 2: Next, we grant the following permissions to the snowflake_access_role on the resources as shown in the following table.
2.1 In the Lake Formation console navigation pane, choose Data permissions, and then select Grant.
2.2. In the Principals section, select the radio button IAM users and roles, also from the drop down select snowflake_access_role.
2.3. In the LF-Tags or catalog resources section, select Named Data Catalog resources:
2.3.1. Select <accountid>:s3tablescatalog/s3tables-snowflake-integration
for Catalogs.
2.3.2. Select testnamespace for Databases.
2.3.3. Select daily_sales for Tables.
2.4. Select SUPER for Table permissions.
2.5. Choose Grant.
The permission configurations through Lake Formation and IAM are now complete. Next, we look at the Snowflake setup.
Part 4. Set up SageMaker Lakehouse Iceberg REST integration in Snowflake
Step 1: Login to Snowflake as admin user who has permission to create database and create catalog integration.
Step 2: Navigate to worksheet and run the following command to create database and catalog integration by providing following parameters.
- S3 table namespace for
CATALOG_NAMESPACE
- SageMaker Lakehouse Iceberg REST endpoint for
CATALOG_URI
AWSAccountID:s3tablescatalog/table-bucket-name
asWAREHOUSE
- IAM role created for snowflake to
SIGV4_IAM_ROLE
- Configure
ACCESS_DELEGATION_MODE = VENDED_CREDENTIALS
for the catalog integration to use the Lake Formation vended temporary credential for data set access. Replace<account-id>
and<region>
with the values for your AWS environment.
Step 3: Follow these steps to get details to update the trust relationship of the role created to access table buckets through Snowflake (“snowflake_access_role”).
Step 4: Verify the catalog integration using the command below.
SELECT SYSTEM$VERIFY_CATALOG_INTEGRATION('glue_rest_catalog_int');
Step 5: Run the below command to mount the S3 table as a Snowflake table.
Part 5. Access S3 Tables through Snowflake
Step 1: Login to Snowflake as admin user who has permission to use catalog integration created.
Step 2: Run the following command to query the table “s3tables_dailysales” on the “S3tables” bucket.
Cleaning up
To clean up your resources, complete the following steps:
- Delete the S3 table.
- Delete the namespace.
- Delete the S3 table bucket.
- You can also drop the Iceberg table, catalog integration, and database in Snowflake with the following commands:
Conclusion
In this post, we’ve looked at connecting your Snowflake environment to query S3 Tables, using the SageMaker Lakehouse Iceberg REST endpoint. Combining Snowflake and AWS gives you multiple options to build out a transactional data lake for analytical and other use cases such as data sharing and collaboration.