AWS Big Data Blog
Enforce fine-grained access control on data lake tables using AWS Glue 5.0 integrated with AWS Lake Formation
AWS Glue 5.0 supports fine-grained access control (FGAC) based on your policies defined in AWS Lake Formation. FGAC enables you to granularly control access to your data lake resources at the table, column, and row levels. This level of control is essential for organizations that need to comply with data governance and security regulations, or those that deal with sensitive data.
Lake Formation makes it straightforward to build, secure, and manage data lakes. It allows you to define fine-grained access controls through grant and revoke statements, similar to those used with relational database management systems (RDBMS), and automatically enforce those policies using compatible engines like Amazon Athena, Apache Spark on Amazon EMR, and Amazon Redshift Spectrum. With AWS Glue 5.0, the same Lake Formation rules that you set up for use with other services like Athena now apply to your AWS Glue Spark jobs and Interactive Sessions through built-in Spark SQL and Spark DataFrames. This simplifies security and governance of your data lakes.
This post demonstrates how to enforce FGAC on AWS Glue 5.0 through Lake Formation permissions.
How FGAC works on AWS Glue 5.0
Using AWS Glue 5.0 with Lake Formation lets you enforce a layer of permissions on each Spark job to apply Lake Formation permissions control when AWS Glue runs jobs. AWS Glue uses Spark resource profiles to create two profiles to effectively run jobs. The user profile runs user-supplied code, and the system profile enforces Lake Formation policies. For more information, see the AWS Lake Formation Developer Guide.
The following diagram demonstrates a high-level overview of how AWS Glue 5.0 gets access to data protected by Lake Formation permissions.
The workflow consists of the following steps:
- A user calls the
StartJobRun
API on a Lake Formation enabled AWS Glue job. - AWS Glue sends the job to a user driver and runs the job in the user profile. The user driver runs a lean version of Spark that has no ability to launch tasks, request executors, or access Amazon Simple Storage Service (Amazon S3) or the AWS Glue Data Catalog. It builds a job plan.
- AWS Glue sets up a second driver called the system driver and runs it in the system profile (with a privileged identity). AWS Glue sets up an encrypted TLS channel between the two drivers for communication. The user driver uses the channel to send the job plans to the system driver. The system driver doesn’t run user-submitted code. It runs full Spark and communicates with Amazon S3 and the Data Catalog for data access. It requests executors and compiles the Job Plan into a sequence of execution stages.
- AWS Glue then runs the stages on executors with the user driver or system driver. The user code in any stage is run exclusively on user profile executors.
- Stages that read data from Data Catalog tables protected by Lake Formation or those that apply security filters are delegated to system executors.
Enable FGAC on AWS Glue 5.0
To enable Lake Formation FGAC for your AWS Glue 5.0 jobs on the AWS Glue console, complete the following steps:
- On the AWS Glue console, choose ETL jobs in the navigation pane.
- Choose your job.
- Choose the Job details
- For Glue version, choose Glue 5.0 – Supports spark 3.5, Scala 2, Python 3.
- For Job parameters, add following parameter:
- Key:
--enable-lakeformation-fine-grained-access
- Value:
true
- Key:
- Choose Save.
To enable Lake Formation FGAC for your AWS Glue notebooks on the AWS Glue console, use %%configure magic
:
Example use case
The following diagram represents the high-level architecture of the use case we demonstrate in this post. The objective of the use case is to showcase how can you enforce Lake Formation FGAC on both CSV and Iceberg tables and configure an AWS Glue PySpark job to read from them.
The implementation consists of the following steps:
- Create an S3 bucket and upload the input CSV dataset.
- Create a standard Data Catalog table and an Iceberg table by reading data from the input CSV table, using an Athena CTAS query.
- Use Lake Formation to enable FGAC on both CSV and Iceberg tables using row- and column-based filters.
- Run two sample AWS Glue jobs to showcase how you can run a sample PySpark script in AWS Glue that respects the Lake Formation FGAC permissions, and then write the output to Amazon S3.
To demonstrate the implementation steps, we use sample product inventory data that has the following attributes:
- op – The operation on the source record. This shows values
I
to represent insert operations,U
to represent updates, andD
to represent deletes. - product_id – The primary key column in the source database’s products table.
- category – The product’s category, such as
Electronics
orCosmetics
. - product_name – The name of the product.
- quantity_available – The quantity available in the inventory for a product.
- last_update_time – The time when the product record was updated at the source database.
To implement this workflow, we create AWS resources such as an S3 bucket, define FGAC with Lake Formation, and build AWS Glue jobs to query those tables.
Prerequisites
Before you get started, make sure you have the following prerequisites:
- An AWS account with AWS Identity and Access Management (IAM) roles as needed.
- The required permissions to perform the following actions:
- Read or write to an S3 bucket.
- Create and run AWS Glue crawlers and jobs.
- Manage Data Catalog databases and tables.
- Manage Athena workgroups and run queries.
- Lake Formation already set up in the account and a Lake Formation administrator role or a similar role to follow along with the instructions in this post. To learn more about setting up permissions for a data lake administrator role, see Create a data lake administrator.
For this post, we use the eu-west-1
AWS Region, but you can integrate it in your preferred Region if the AWS services included in the architecture are available in that Region.
Next, let’s dive into the implementation steps.
Create an S3 bucket
To create an S3 bucket for the raw input datasets and Iceberg table, complete the following steps:
- On the Amazon S3 console, choose Buckets in the navigation pane.
- Choose Create bucket.
- Enter the bucket name (for example,
glue5-lf-demo-${AWS_ACCOUNT_ID}-${AWS_REGION_CODE}
), and leave the remaining fields as default. - Choose Create bucket.
- On the bucket details page, choose Create folder.
- Create two subfolders:
raw-csv-input
andiceberg-datalake
.
- Upload the LOAD00000001.csv file into the
raw-csv-input
folder of the bucket.
Create tables
To create input and output tables in the Data Catalog, complete the following steps:
- On the Athena console, navigate to the query editor.
- Run the following queries in sequence (provide your S3 bucket name):
- Run the following query to validate the raw CSV input data:
The following screenshot shows the query result.
- Run the following query to validate the Iceberg table data:
The following screenshot shows the query result.
This step used DDL to create table definitions. Alternatively, you can use a Data Catalog API, the AWS Glue console, the Lake Formation console, or an AWS Glue crawler.
Next, let’s configure Lake Formation permissions on the raw_csv_input
table and iceberg_datalake
table.
Configure Lake Formation permissions
To validate the capability, let’s define FGAC permissions for the two Data Catalog tables we created.
For the raw_csv_input
table, we enable permission for specific rows, for example allow read access only for the Furniture
category. Similarly, for the iceberg_datalake
table, we enable a data filter for the Electronics
product category and limit read access to a few columns only.
To configure Lake Formation permissions for the two tables, complete the following steps:
- On the Lake Formation console, choose Data lake locations under Administration in the navigation pane.
- Choose Register location.
- For Amazon S3 path, enter the path of your S3 bucket to register the location.
- For IAM role, choose your Lake Formation data access IAM role, which is not a service linked role.
- For Permission mode, select Lake Formation.
- Choose Register location.
Grant table permissions on the standard table
The next step is to grant table permissions on the raw_csv_input
table to the AWS Glue job role.
- On the Lake Formation console, choose Data lake permissions under Permissions in the navigation pane.
- Choose Grant.
- For Principals, choose IAM users and roles.
- For IAM users and roles, choose your IAM role that is going to be used on an AWS Glue job.
- For LF-Tags or catalog resources, choose Named Data Catalog resources.
- For Databases, choose
glue5_lf_demo
. - For Tables, choose
raw_csv_input
. - For Data filters, choose Create new.
- In the Create data filter dialog, provide the following information:
- For Data filter name, enter
product_furniture
. - For Column-level access, select Access to all columns.
- Select Filter rows.
- For Row filter expression, enter
category='Furniture'
. - Choose Create filter.
- For Data filter name, enter
- For Data filters, select the filter
product_furniture
you created.
- For Data filter permissions, choose Select and Describe.
- Choose Grant.
Grant permissions on the Iceberg table
The next step is to grant table permissions on the iceberg_datalake
table to the AWS Glue job role.
- On the Lake Formation console, choose Data lake permissions under Permissions in the navigation pane.
- Choose Grant.
- For Principals, choose IAM users and roles.
- For IAM users and roles, choose your IAM role that is going to be used on an AWS Glue job.
- For LF-Tags or catalog resources, choose Named Data Catalog resources.
- For Databases, choose
glue5_lf_demo
. - For Tables, choose
iceberg_datalake
. - For Data filters, choose Create new.
- In the Create data filter dialog, provide the following information:
- For Data filter name, enter
product_electronics
. - For Column-level access, select Include columns.
- For Included columns, choose
category
,last_update_time
,op
,product_name
, andquantity_available
. - Choose Filter rows.
- For Row filter expression, enter
category='Electronics'
. - Choose Create filter.
- For Data filter name, enter
- For Data filters, select the filter
product_electronics
you created. - For Data filter permissions, choose Select and Describe.
- Choose
Next, let’s create the AWS Glue PySpark job to process the input data.
Query the standard table through an AWS Glue 5.0 job
Complete the following steps to create an AWS Glue job to load data from the raw_csv_input
table:
- On the AWS Glue console, choose ETL jobs in the navigation pane.
- For Create job, choose Script Editor.
- For Engine, choose Spark.
- For Options, choose Start fresh.
- Choose Create script.
- For Script, use the following code, providing your S3 output path. This example script writes the output in Parquet format; you can change this according to your use case.
- On the Job details tab, for Name, enter
glue5-lf-demo
. - For IAM Role, assign an IAM role that has the required permissions to run an AWS Glue job and read and write to the S3 bucket.
- For Glue version, choose Glue 5.0 – Supports spark 3.5, Scala 2, Python 3.
- For Job parameters, add following parameter:
- Key:
--enable-lakeformation-fine-grained-access
- Value:
true
- Key:
- Choose Save and then Run.
- When the job is complete, on the Run details tab at the bottom of job runs, choose Output logs.
You’re redirected to the Amazon CloudWatch console to validate the output.
The printed table is shown in the following screenshot. Only two records were returned because they are Furniture
category products.
Query the Iceberg table through an AWS Glue 5.0 job
Next, complete the following steps to create an AWS Glue job to load data from the iceberg_datalake
table:
- On the AWS Glue console, choose ETL jobs in the navigation pane.
- For Create job, choose Script Editor.
- For Engine, choose Spark.
- For Options, choose Start fresh.
- Choose Create script.
- For Script, replace the following parameters:
- Replace
aws_region
with your Region. - Replace
aws_account_id
with your AWS account ID. - Replace
warehouse_path
with your S3 warehouse path for the Iceberg table. - Replace
<s3_output_path>
with your S3 output path.
- Replace
This example script writes the output in Parquet format; you can change it according to your use case.
- On the Job details tab, for Name, enter
glue5-lf-demo-iceberg
. - For IAM Role, assign an IAM role that has the required permissions to run an AWS Glue job and read and write to the S3 bucket.
- For Glue version, choose Glue 5.0 – Supports spark 3.5, Scala 2, Python 3.
- For Job parameters, add following parameters:
- Key:
--enable-lakeformation-fine-grained-access
- Value:
true
- Key:
--datalake-formats
- Value:
iceberg
- Key:
- Choose Save and then Run.
- When the job is complete, on the Run details tab, choose Output logs.
You’re redirected to the CloudWatch console to validate the output.
The printed table is shown in the following screenshot. Only two records were returned because they are Electronics
category products, and the product_id
column is excluded.
You are now able to verify that records of the table raw_csv_input
and the table iceberg_datalake
are successfully retrieved with configured Lake Formation data cell filters.
Clean up
Complete the following steps to clean up your resources:
- Delete the AWS Glue jobs
glue5-lf-demo
andglue5-lf-demo-iceberg
. - Delete the Lake Formation permissions.
- Delete the output files written to the S3 bucket.
- Delete the bucket you created for the input datasets, which might have a name similar to
glue5-lf-demo-${AWS_ACCOUNT_ID}-${AWS_REGION_CODE}
.
Conclusion
This post explained how you can enable Lake Formation FGAC in AWS Glue jobs and notebooks that will enforce access control defined using Lake Formation grant commands. Previously, you needed to integrate AWS Glue DynamicFrames to enforce FGAC in AWS Glue jobs, but with this release, you can enforce FGAC through Spark DataFrame or Spark SQL. This capability also works not only with standard file formats like CSV, JSON, and Parquet but also with Apache Iceberg.
This feature can save you effort and encourage portability while migrating Spark scripts to different serverless environments such as AWS Glue and Amazon EMR.
About the Authors
Sakti Mishra is a Principal Solutions Architect at AWS, where he helps customers modernize their data architecture and define end-to end-data strategies, including data security, accessibility, governance, and more. He is also the author of Simplify Big Data Analytics with Amazon EMR and AWS Certified Data Engineer Study Guide. Outside of work, Sakti enjoys learning new technologies, watching movies, and visiting places with family. He can be reached via LinkedIn.
Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is also the author of the book Serverless ETL and Analytics with AWS Glue. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.
Matt Su is a Senior Product Manager on the AWS Glue team. He enjoys helping customers uncover insights and make better decisions using their data with AWS Analytics services. In his spare time, he enjoys skiing and gardening.
Layth Yassin is a Software Development Engineer on the AWS Glue team. He’s passionate about tackling challenging problems at a large scale, and building products that push the limits of the field. Outside of work, he enjoys playing/watching basketball, and spending time with friends and family.