Enforce fine-grained access control on data lake tables using AWS Glue 5.0 integrated with AWS Lake Formation

AWS Glue 5.0 supports fine-grained access control (FGAC) based on your policies defined in AWS Lake Formation. FGAC enables you to granularly control access to your data lake resources at the table, column, and row levels. This level of control is essential for organizations that need to comply with data governance and security regulations, or those that deal with sensitive data.

Lake Formation makes it straightforward to build, secure, and manage data lakes. It allows you to define fine-grained access controls through grant and revoke statements, similar to those used with relational database management systems (RDBMS), and automatically enforce those policies using compatible engines like Amazon Athena, Apache Spark on Amazon EMR, and Amazon Redshift Spectrum. With AWS Glue 5.0, the same Lake Formation rules that you set up for use with other services like Athena now apply to your AWS Glue Spark jobs and Interactive Sessions through built-in Spark SQL and Spark DataFrames. This simplifies security and governance of your data lakes.

This post demonstrates how to enforce FGAC on AWS Glue 5.0 through Lake Formation permissions.

How FGAC works on AWS Glue 5.0

Using AWS Glue 5.0 with Lake Formation lets you enforce a layer of permissions on each Spark job to apply Lake Formation permissions control when AWS Glue runs jobs. AWS Glue uses Spark resource profiles to create two profiles to effectively run jobs. The user profile runs user-supplied code, and the system profile enforces Lake Formation policies. For more information, see the AWS Lake Formation Developer Guide.

The following diagram demonstrates a high-level overview of how AWS Glue 5.0 gets access to data protected by Lake Formation permissions.

The workflow consists of the following steps:

A user calls the StartJobRun API on a Lake Formation enabled AWS Glue job.
AWS Glue sends the job to a user driver and runs the job in the user profile. The user driver runs a lean version of Spark that has no ability to launch tasks, request executors, or access Amazon Simple Storage Service (Amazon S3) or the AWS Glue Data Catalog. It builds a job plan.
AWS Glue sets up a second driver called the system driver and runs it in the system profile (with a privileged identity). AWS Glue sets up an encrypted TLS channel between the two drivers for communication. The user driver uses the channel to send the job plans to the system driver. The system driver doesn’t run user-submitted code. It runs full Spark and communicates with Amazon S3 and the Data Catalog for data access. It requests executors and compiles the Job Plan into a sequence of execution stages.
AWS Glue then runs the stages on executors with the user driver or system driver. The user code in any stage is run exclusively on user profile executors.
Stages that read data from Data Catalog tables protected by Lake Formation or those that apply security filters are delegated to system executors.

Enable FGAC on AWS Glue 5.0

To enable Lake Formation FGAC for your AWS Glue 5.0 jobs on the AWS Glue console, complete the following steps:

On the AWS Glue console, choose ETL jobs in the navigation pane.
Choose your job.
Choose the Job details
For Glue version, choose Glue 5.0 – Supports spark 3.5, Scala 2, Python 3.
For Job parameters, add following parameter:
1. Key: --enable-lakeformation-fine-grained-access
2. Value: true
Choose Save.

To enable Lake Formation FGAC for your AWS Glue notebooks on the AWS Glue console, use %%configure magic:

%glue_version 5.0
%%configure
{
    "--enable-lakeformation-fine-grained-access": "true"
}

Example use case

The following diagram represents the high-level architecture of the use case we demonstrate in this post. The objective of the use case is to showcase how can you enforce Lake Formation FGAC on both CSV and Iceberg tables and configure an AWS Glue PySpark job to read from them.

The implementation consists of the following steps:

Create an S3 bucket and upload the input CSV dataset.
Create a standard Data Catalog table and an Iceberg table by reading data from the input CSV table, using an Athena CTAS query.
Use Lake Formation to enable FGAC on both CSV and Iceberg tables using row- and column-based filters.
Run two sample AWS Glue jobs to showcase how you can run a sample PySpark script in AWS Glue that respects the Lake Formation FGAC permissions, and then write the output to Amazon S3.

To demonstrate the implementation steps, we use sample product inventory data that has the following attributes:

op – The operation on the source record. This shows values I to represent insert operations, U to represent updates, and D to represent deletes.
product_id – The primary key column in the source database’s products table.
category – The product’s category, such as Electronics or Cosmetics.
product_name – The name of the product.
quantity_available – The quantity available in the inventory for a product.
last_update_time – The time when the product record was updated at the source database.

To implement this workflow, we create AWS resources such as an S3 bucket, define FGAC with Lake Formation, and build AWS Glue jobs to query those tables.

Prerequisites

Before you get started, make sure you have the following prerequisites:

An AWS account with AWS Identity and Access Management (IAM) roles as needed.
The required permissions to perform the following actions:
- Read or write to an S3 bucket.
- Create and run AWS Glue crawlers and jobs.
- Manage Data Catalog databases and tables.
- Manage Athena workgroups and run queries.
Lake Formation already set up in the account and a Lake Formation administrator role or a similar role to follow along with the instructions in this post. To learn more about setting up permissions for a data lake administrator role, see Create a data lake administrator.

For this post, we use the eu-west-1 AWS Region, but you can integrate it in your preferred Region if the AWS services included in the architecture are available in that Region.

Next, let’s dive into the implementation steps.

Create an S3 bucket

To create an S3 bucket for the raw input datasets and Iceberg table, complete the following steps:

On the Amazon S3 console, choose Buckets in the navigation pane.
Choose Create bucket.
Enter the bucket name (for example, glue5-lf-demo-${AWS_ACCOUNT_ID}-${AWS_REGION_CODE}), and leave the remaining fields as default.
Choose Create bucket.
On the bucket details page, choose Create folder.
Create two subfolders: raw-csv-input and iceberg-datalake.
Upload the LOAD00000001.csv file into the raw-csv-input folder of the bucket.

Create tables

To create input and output tables in the Data Catalog, complete the following steps:

On the Athena console, navigate to the query editor.

Run the following queries in sequence (provide your S3 bucket name):

-- Create database for the demo
CREATE DATABASE glue5_lf_demo;

-- Create external table in input CSV files. Replace the S3 path with your bucket name
CREATE EXTERNAL TABLE glue5_lf_demo.raw_csv_input(
 op string, 
 product_id bigint, 
 category string, 
 product_name string, 
 quantity_available bigint, 
 last_update_time string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://<bucket-name>/raw-csv-input/'
TBLPROPERTIES (
  'areColumnsQuoted'='false', 
  'classification'='csv', 
  'columnsOrdered'='true', 
  'compressionType'='none', 
  'delimiter'=',', 
  'typeOfData'='file');
 
-- Create output Iceberg table with partitioning. Replace the S3 bucket name with your bucket name
CREATE TABLE glue5_lf_demo.iceberg_datalake WITH (
  table_type='ICEBERG',
  format='parquet',
  write_compression = 'SNAPPY',
  is_external = false,
  partitioning=ARRAY['category', 'bucket(product_id, 16)'],
  location='s3://<bucket-name>/iceberg-datalake/'
) AS SELECT * FROM glue5_lf_demo.raw_csv_input;

Run the following query to validate the raw CSV input data:
```
SELECT * FROM glue5_lf_demo.raw_csv_input;
```

The following screenshot shows the query result.

Run the following query to validate the Iceberg table data:
```
SELECT * FROM glue5_lf_demo.iceberg_datalake;
```

The following screenshot shows the query result.

This step used DDL to create table definitions. Alternatively, you can use a Data Catalog API, the AWS Glue console, the Lake Formation console, or an AWS Glue crawler.

Next, let’s configure Lake Formation permissions on the raw_csv_input table and iceberg_datalake table.

Configure Lake Formation permissions

To validate the capability, let’s define FGAC permissions for the two Data Catalog tables we created.

For the raw_csv_input table, we enable permission for specific rows, for example allow read access only for the Furniture category. Similarly, for the iceberg_datalake table, we enable a data filter for the Electronics product category and limit read access to a few columns only.

To configure Lake Formation permissions for the two tables, complete the following steps:

On the Lake Formation console, choose Data lake locations under Administration in the navigation pane.
Choose Register location.
For Amazon S3 path, enter the path of your S3 bucket to register the location.
For IAM role, choose your Lake Formation data access IAM role, which is not a service linked role.
For Permission mode, select Lake Formation.
Choose Register location.

Grant table permissions on the standard table

The next step is to grant table permissions on the raw_csv_input table to the AWS Glue job role.

On the Lake Formation console, choose Data lake permissions under Permissions in the navigation pane.
Choose Grant.
For Principals, choose IAM users and roles.
For IAM users and roles, choose your IAM role that is going to be used on an AWS Glue job.
For LF-Tags or catalog resources, choose Named Data Catalog resources.
For Databases, choose glue5_lf_demo.
For Tables, choose raw_csv_input.
For Data filters, choose Create new.
In the Create data filter dialog, provide the following information:
1. For Data filter name, enter product_furniture.
2. For Column-level access, select Access to all columns.
3. Select Filter rows.
4. For Row filter expression, enter category='Furniture'.
5. Choose Create filter.

For Data filters, select the filter product_furniture you created.
For Data filter permissions, choose Select and Describe.
Choose Grant.

Grant permissions on the Iceberg table

The next step is to grant table permissions on the iceberg_datalake table to the AWS Glue job role.

On the Lake Formation console, choose Data lake permissions under Permissions in the navigation pane.
Choose Grant.
For Principals, choose IAM users and roles.
For IAM users and roles, choose your IAM role that is going to be used on an AWS Glue job.
For LF-Tags or catalog resources, choose Named Data Catalog resources.
For Databases, choose glue5_lf_demo.
For Tables, choose iceberg_datalake.
For Data filters, choose Create new.
In the Create data filter dialog, provide the following information:
1. For Data filter name, enter product_electronics.
2. For Column-level access, select Include columns.
3. For Included columns, choose category, last_update_time, op, product_name, and quantity_available.
4. Choose Filter rows.
5. For Row filter expression, enter category='Electronics'.
6. Choose Create filter.
For Data filters, select the filter product_electronics you created.
For Data filter permissions, choose Select and Describe.
Choose

Next, let’s create the AWS Glue PySpark job to process the input data.

Query the standard table through an AWS Glue 5.0 job

Complete the following steps to create an AWS Glue job to load data from the raw_csv_input table:

On the AWS Glue console, choose ETL jobs in the navigation pane.
For Create job, choose Script Editor.
For Engine, choose Spark.
For Options, choose Start fresh.
Choose Create script.

For Script, use the following code, providing your S3 output path. This example script writes the output in Parquet format; you can change this according to your use case.

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# Read from raw CSV table
df = spark.sql("SELECT * FROM glue5_lf_demo.raw_csv_input")
df.show()

# Write to your preferred location.
df.write.mode("overwrite").parquet("s3://<s3_output_path>")

On the Job details tab, for Name, enter glue5-lf-demo.
For IAM Role, assign an IAM role that has the required permissions to run an AWS Glue job and read and write to the S3 bucket.
For Glue version, choose Glue 5.0 – Supports spark 3.5, Scala 2, Python 3.
For Job parameters, add following parameter:
1. Key: --enable-lakeformation-fine-grained-access
2. Value: true

Choose Save and then Run.
When the job is complete, on the Run details tab at the bottom of job runs, choose Output logs.

You’re redirected to the Amazon CloudWatch console to validate the output.

The printed table is shown in the following screenshot. Only two records were returned because they are Furniture category products.

Query the Iceberg table through an AWS Glue 5.0 job

Next, complete the following steps to create an AWS Glue job to load data from the iceberg_datalake table:

On the AWS Glue console, choose ETL jobs in the navigation pane.
For Create job, choose Script Editor.
For Engine, choose Spark.
For Options, choose Start fresh.
Choose Create script.
For Script, replace the following parameters:
1. Replace aws_region with your Region.
2. Replace aws_account_id with your AWS account ID.
3. Replace warehouse_path with your S3 warehouse path for the Iceberg table.
4. Replace <s3_output_path> with your S3 output path.

This example script writes the output in Parquet format; you can change it according to your use case.

from pyspark.context import SparkContext
from pyspark.sql import SparkSession

catalog_name = "spark_catalog"
aws_region = "eu-west-1"
aws_account_id = "123456789012"
warehouse_path = "s3://<bucket-name>/warehouse"

# Create Spark Session with Iceberg Configurations
spark = SparkSession.builder \
    .config(f"spark.sql.catalog.{catalog_name}", "org.apache.iceberg.spark.SparkSessionCatalog") \
    .config(f"spark.sql.catalog.{catalog_name}.warehouse", f"{warehouse_path}") \
    .config(f"spark.sql.catalog.{catalog_name}.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") \
    .config(f"spark.sql.catalog.{catalog_name}.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \
    .config("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config(f"spark.sql.catalog.{catalog_name}.client.region", f"{aws_region}") \
    .config(f"spark.sql.catalog.{catalog_name}.glue.account-id", f"{aws_account_id}") \
    .getOrCreate()

# Read from Iceberg table
df = spark.sql(f"SELECT * FROM {catalog_name}.glue5_lf_demo.iceberg_datalake")
df.show()

# Write to your preferred location.
df.write.mode("overwrite").parquet("s3://<s3_output_path>")

On the Job details tab, for Name, enter glue5-lf-demo-iceberg.
For IAM Role, assign an IAM role that has the required permissions to run an AWS Glue job and read and write to the S3 bucket.
For Glue version, choose Glue 5.0 – Supports spark 3.5, Scala 2, Python 3.
For Job parameters, add following parameters:
1. Key: --enable-lakeformation-fine-grained-access
2. Value: true
3. Key: --datalake-formats
4. Value: iceberg
Choose Save and then Run.
When the job is complete, on the Run details tab, choose Output logs.

You’re redirected to the CloudWatch console to validate the output.

The printed table is shown in the following screenshot. Only two records were returned because they are Electronics category products, and the product_id column is excluded.

You are now able to verify that records of the table raw_csv_input and the table iceberg_datalake are successfully retrieved with configured Lake Formation data cell filters.

Clean up

Complete the following steps to clean up your resources:

Delete the AWS Glue jobs glue5-lf-demo and glue5-lf-demo-iceberg.
Delete the Lake Formation permissions.
Delete the output files written to the S3 bucket.
Delete the bucket you created for the input datasets, which might have a name similar to glue5-lf-demo-${AWS_ACCOUNT_ID}-${AWS_REGION_CODE}.

Conclusion

This post explained how you can enable Lake Formation FGAC in AWS Glue jobs and notebooks that will enforce access control defined using Lake Formation grant commands. Previously, you needed to integrate AWS Glue DynamicFrames to enforce FGAC in AWS Glue jobs, but with this release, you can enforce FGAC through Spark DataFrame or Spark SQL. This capability also works not only with standard file formats like CSV, JSON, and Parquet but also with Apache Iceberg.

This feature can save you effort and encourage portability while migrating Spark scripts to different serverless environments such as AWS Glue and Amazon EMR.

About the Authors

Sakti Mishra is a Principal Solutions Architect at AWS, where he helps customers modernize their data architecture and define end-to end-data strategies, including data security, accessibility, governance, and more. He is also the author of Simplify Big Data Analytics with Amazon EMR and AWS Certified Data Engineer Study Guide. Outside of work, Sakti enjoys learning new technologies, watching movies, and visiting places with family. He can be reached via LinkedIn.

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is also the author of the book Serverless ETL and Analytics with AWS Glue. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.

Matt Su is a Senior Product Manager on the AWS Glue team. He enjoys helping customers uncover insights and make better decisions using their data with AWS Analytics services. In his spare time, he enjoys skiing and gardening.

Layth Yassin is a Software Development Engineer on the AWS Glue team. He’s passionate about tackling challenging problems at a large scale, and building products that push the limits of the field. Outside of work, he enjoys playing/watching basketball, and spending time with friends and family.

AWS Big Data Blog