Introducing AWS Glue crawlers using AWS Lake Formation permission management

Data lakes provide a centralized repository that consolidates your data at scale and makes it available for different kinds of analytics. AWS Glue crawlers are a popular way to scan data in a data lake, classify it, extract schema information from it, and store the metadata automatically in the AWS Glue Data Catalog. AWS Lake Formation enables you to centrally govern, secure, and share your data, and lets you scale permissions easily.

We are pleased to announce AWS Glue crawler and Lake Formation integration. You can now use Lake Formation permissions for the crawler’s access to your Lake Formation managed data lakes, whether those are in your account or in other accounts. Before this release, you had to set up AWS Glue crawler IAM role with Amazon Simple Storage Service (Amazon S3) permissions to crawl data source on Amazon S3. And also establish Amazon S3 bucket policies on the source bucket for the crawler role to access S3 data source. Now you can use AWS Lake Formation permission defined on data lake for crawling the data and you no longer need to configure dedicated Amazon S3 permissions for crawlers. AWS Lake Formation manages crawler IAM role access to various Amazon S3 buckets and/or its prefix using data locations permissions to simplify security management. Further you can apply the same security model for crawlers in addition to AWS Glue jobs, Amazon Athena for centralized governance.

When you configure an AWS Glue crawler to use Lake Formation, by default, the crawler uses Lake Formation in the same account to obtain data access credentials. However, you can also configure the crawler to use Lake Formation of a different account by providing an account ID during creation. The cross-account capability allows you to perform permissions management from a central governance account. Customers prefer the central governance experience over writing bucket policies separately in each bucket-owning account. To build a data mesh architecture, you can author permissions in a single Lake Formation governance to manage access to data locations and crawlers spanning multiple accounts in your data lake. You can refer to How to configure a crawler to use Lake Formation credentials for more information.

In this post, we walk through a single in-account architecture that shows how to enable Lake Formation permissions on the data lake, configure an AWS Glue crawler with Lake Formation permission to scan and populate schema from an S3 data lake into the AWS Glue Data Catalog, and then use an analytical engine like Amazon Athena to query the data.

Solution overview

The AWS Glue crawler and Lake Formation integration supports in-account crawling as well as cross-account crawling. You can configure a crawler to use Lake Formation permissions to access an S3 data store or a Data Catalog table with an underlying S3 location within the same AWS account or another AWS account. You can configure an existing Data Catalog table as a crawler’s target if the crawler and the Data Catalog table reside in the same account. The following figure shows the in-account crawling architecture.

Prerequisites

Complete the following prerequisite steps:

Sign in to the Lake Formation console as admin.
If this is the first time accessing the Lake Formation console, add yourself as the data lake administrator.
In the navigation pane, under Data catalog, choose Settings.
Deselect Use only IAM access control for new databases.
Deselect Use only IAM access control for new tables in new databases.
Keep Version 3 as the current cross-account version.
Choose Save.

Set up your solution resources

We set up the solution resources using AWS CloudFormation. Complete the following steps:

Log in to the AWS Management Console as IAM administrator.
Choose Launch Stack to deploy a CloudFormation template:
For LFBusinessAnalystUserName, keep as the default LFBusinessAnalyst.
Create your stack.
When the stack is complete, on the AWS CloudFormation console, navigate to the Resources tab of the stack.
Note down value of Databasename, DataLakeBucket, and GlueCrawlerName.
Choose the LFBusinessAnalystUserCredentials value to navigate to the AWS Secrets Manager console.
In the Secret value section, choose Retrieve secret value.
Note down the secret value for the password for IAM user LFBusinessAnalyst.

Validate resources

In your account, validate the following resources created by template:

AWS Glue database – The Databasename value noted from the CloudFormation template.
S3 bucket for the data lake with sample data – The DataLakeBucketvalue value noted from the CloudFormation template.
AWS Glue crawler and IAM role with required permission – The GlueCrawlerName value noted from the CloudFormation template.

The template registers the S3 bucket with Lake Formation as the data location. On Lake Formation console left navigation choose Data lake locations under Register and ingest.

The template also grants data location permission on the S3 bucket to the crawler role. On Lake Formation console left navigation choose Data locations under Permissions.

Lastly, the template grants database permission to the crawler role. On Lake Formation console left navigation choose Data lake permissions under Permissions.

Edit and run the AWS Glue crawler

To configure and run the AWS Glue crawler, complete the following steps:

On the AWS Glue console, choose Crawlers in the navigation pane.
Locate the crawler lfcrawler-<your-account-id> and edit it.
Under Lake Formation configuration, select Use Lake Formation credentials for crawling S3 data source.
Choose Next.
Review and update the crawler settings.

Note that the crawler IAM role uses Lake Formation permission to access the data and doesn’t have any S3 policies.

Run the crawler and verify that the crawler run is complete.
In the AWS Glue database lfcrawlerdb<your-account-id>, verify that the table is created and the schema matches with what you have in the S3 bucket.

The crawler was able to crawl the S3 data source and successfully populate the schema using Lake Formation permissions.

Grant access to the data analyst using Lake Formation

Now the data lake admin can delegate permissions on the database and table to the LFBusinessAnalyst user via the Lake Formation console.

Grant the LFBusinessAnalyst IAM user access to the database with Describe permissions.

On the Lake Formation console, under Permissions in the navigation pane, choose Data lake permission .
Choose Grant
Under Principals, select IAM users and roles.
Choose the IAM users LFBusinessAnalyst
Under LF-Tags or catalog resources, choose lfcrawlerdb<your-accountid> for Databases.
Select Describe for Database permissions.
Choose Grant to apply the permissions.

Grant the LFBusinessAnalyst IAM user Select and Describe access to the table.

On the Lake Formation console, under Permissions in the navigation pane, choose Data lake permission.
Choose Grant.
Under Principals, select IAM users and roles.
Choose the IAM users LFBusinessAnalyst.
Under LF-Tags or catalog resources, choose lfcrawlerdb<your-accountid> for Databases and lf_datalake_<your-accountid>_<region> for Tables
Choose Select, Describe for Table permissions.
Choose Grant to apply the permissions.

Verify the tables using Athena

To verify the tables using Athena, complete the following steps:

Log in as LFBusinessAnalyst using the password noted earlier through the CloudFormation stack.
On the Athena console, choose lfconsumer-primary-workgroup as the Athena workgroup.
Run the query to validate access as shown in the following screenshot.

We have successfully crawled Amazon S3 data store using the crawler with Lake Formation permission and populated the metadata in AWS Glue Data Catalog. We have granted Lake Formation permission on database and table to consumer user and validated user access to the data using Athena.

Clean up

To avoid unwanted charges to your AWS account, you can delete the AWS resources:

Sign in to the CloudFormation console as the IAM admin used for creating the CloudFormation stack.
Delete the stack you created.

Summary

In this post, we showed how to use the new AWS Glue crawler integration with Lake Formation. Data lake admins can now share crawled tables with data analysts using Lake Formation, allowing analysts to use analytical services such as Athena. You can centrally manage all permissions in Lake Formation, making it easier to administer and protect data lakes.

Special thanks to everyone who contributed to this crawler feature launch: Anshuman Sharma, Jessica Cheng, Aditya K, Sandya Krishnanand

If you have questions or suggestions, submit them in the comments section.

About the authors

Sandeep Adwankar is a Senior Technical Product Manager at AWS. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure, and access data.

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building data mesh solutions and sharing them with the community.

AWS Big Data Blog