Introducing AWS Glue 5.0 for Apache Spark

AWS Glue is a serverless, scalable data integration service that makes it simple to discover, prepare, move, and integrate data from multiple sources. Today, we are launching AWS Glue 5.0, a new version of AWS Glue that accelerates data integration workloads in AWS. AWS Glue 5.0 upgrades the Spark engines to Apache Spark 3.5.2 and Python 3.11, giving you newer Spark and Python releases so you can develop, run, and scale your data integration workloads and get insights faster.

This post describes what’s new in AWS Glue 5.0, performance improvements, key highlights on Spark and related libraries, and how to get started on AWS Glue 5.0.

What’s new in AWS Glue 5.0

AWS Glue 5.0 upgrades the runtimes to Spark 3.5.2, Python 3.11, and Java 17 with new performance and security improvements from the open source. AWS Glue 5.0 also updates support for open table format libraries to Apache Hudi 0.15.0, Apache Iceberg 1.6.1, and Delta Lake 3.2.1 so you can solve advanced use cases around performance, cost, governance, and privacy in your data lakes. AWS Glue 5.0 adds support for Spark-native fine-grained access control with AWS Lake Formation so you can apply table- and column-level permissions on an Amazon Simple Storage Service (Amazon S3) data lake for write operations (such as INSERT INTO and INSERT OVERWRITE) with Spark jobs.

Key features include:

Amazon SageMaker Unified Studio support
Amazon SageMaker Lakehouse support
Frameworks updated to Spark 3.5.2, Python 3.11, Scala 2.12.18, and Java 17
Open Table Formats (OTF) updated to Hudi 0.15.0, Iceberg 1.6.1, and Delta Lake 3.2.1
Spark-native fine-grained access control using Lake Formation
Amazon S3 Access Grants support
requirements.txt support to install additional Python libraries
Data lineage support in Amazon DataZone

Amazon SageMaker Unified Studio support

Amazon SageMaker Unified Studio supports AWS Glue 5.0 for compute runtime of unified notebooks and visual ETL flow editor.

Amazon SageMaker Lakehouse support

Glue 5.0 supports native integration with Amazon SageMaker Lakehouse to enable unified access across Amazon Redshift data warehouses and S3 data lakes.

Frameworks updated to Spark 3.5.2, Python 3.11, Scala 2.12.18, and Java 17

AWS Glue 5.0 upgrades the runtimes to Spark 3.5.2, Python 3.11, Scala 2.12.18, and Java 17. Glue 5.0 uses AWS performance optimized Spark runtime, 3.9 times faster than open source Spark. Glue 5.0 is 32% faster than AWS Glue 4.0 and reduces costs by 22%.

For more details about updated library dependencies, see Dependent library upgrades section.

Open Table Formats (OTF) updated to Hudi 0.15.0, Iceberg 1.6.1, and Delta Lake 3.2.1

AWS Glue 5.0 upgrades the open table format libraries to Hudi 0.15.0, Iceberg 1.6.1, and Delta Lake 3.2.1. To learn more, visit Use open table format libraries on AWS Glue 5.0 for Apache Spark.

Spark-native fine-grained access control using Lake Formation

AWS Glue supports AWS Lake Formation Fine Grained Access Control (FGAC) through native Spark DataFrames and Spark SQL. To learn more, visit Enforce fine-grained access control on data lake tables using AWS Glue 5.0 integrated with AWS Lake Formation.

S3 Access Grants support

S3 Access Grants provides a simplified model for defining access permissions to data in Amazon S3 by prefix, bucket, or object. AWS Glue 5.0 supports S3 Access Grants through EMR File System (EMRFS) using additional Spark configurations:

Key: --conf
Value: hadoop.fs.s3.s3AccessGrants.enabled=true --conf spark.hadoop.fs.s3.s3AccessGrants.fallbackToIAM=false

To learn more, refer to Managing access with S3 Access Grants.

requirements.txt support to install additional Python libraries

In AWS Glue 5.0, you can provide the standard requirements.txt file to manage Python library dependencies. To do that, provide the following job parameters:

Parameter 1:
- Key: --python-modules-installer-option
- Value: -r
Parameter 2:
- Key: --additional-python-modules
- Value: s3://path_to_requirements.txt

AWS Glue 5.0 nodes initially load Python libraries specified in requirements.txt. The following code illustrates the sample requirements.txt:

awswrangler==3.9.1 
elasticsearch==8.15.1
PyAthena==3.9.0
PyMySQL==1.1.1
PyYAML==6.0.2
pyodbc==5.2.0
pyorc==0.9.0 
redshift-connector==2.1.3
scipy==1.14.1
scikit-learn==1.5.2
SQLAlchemy==2.0.36

Data lineage support in Amazon DataZone

AWS Glue 5.0 supports data lineage in Amazon DataZone. You can configure AWS Glue to automatically collect lineage information during Spark job runs and send the lineage events to be visualized in Amazon DataZone.

To configure this on the AWS Glue console, enable Generate lineage events, and enter your Amazon DataZone domain ID on the Job details tab.

Alternatively, you can provide the following job parameter (provide your DataZone domain ID):

Key: --conf
Value: spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener --conf spark.openlineage.transport.type=amazon_datazone_api --conf spark.openlineage.transport.domainId=<Your-Domain-ID>

Learn more in Amazon DataZone introduces OpenLineage-compatible data lineage visualization

Improved performance

AWS Glue 5.0 improves the price-performance of your AWS Glue jobs. AWS Glue 5.0 is 32% faster than AWS Glue 4.0 and reduces costs by 22%. The following chart shows the total job runtime for all queries (in seconds) in the 3 TB query dataset between AWS Glue 4.0 and AWS Glue 5.0. The TPC-DS dataset is located in an S3 bucket in Parquet format, and we used 30 G.2X workers in AWS Glue. We observed that our AWS Glue 5.0 TPC-DS tests on Amazon S3 was 58% faster than that on AWS Glue 4.0 while reducing cost by 36%.

.	AWS Glue 4.0	AWS Glue 5.0
Total Query Time (seconds)	1896.1904	1197.78755
Geometric Mean (seconds)	10.09472	6.82208
Estimated Cost ($)	45.85533	29.20133

The following graphs illustrates the comparisons of performance and cost.

Dependent library upgrades

The following table lists dependency upgrades.

Dependency	Version in AWS Glue 4.0	Version in AWS Glue 5.0
Spark	3.3.0	3.5.2
Hadoop	3.3.3	3.4.0
Scala	2.12	2.12.18
Hive	2.3.9	2.3.9
EMRFS	2.54.0	2.66.0
Arrow	7.0.0	12.0.1
Iceberg	1.0.0	1.6.1
Hudi	0.12.1	0.15.0
Delta Lake	2.1.0	3.2.1
Java	8	17
Python	3.10	3.11
boto3	1.26	1.34.131
AWS SDK for Java	1.12	2.28.8
AWS Glue Data Catalog Client	3.7.0	4.2.0
EMR DynamoDB Connector	4.16.0	5.6.0

The following table lists database connector (JDBC driver) upgrades.

Driver	Connector Version in AWS Glue 4.0	Connector Version in AWS Glue 5.0
MySQL	8.0.23	8.0.33
Microsoft SQL Server	9.4.0	10.2.0
Oracle Databases	21.7	23.3.0.23.09
PostgreSQL	42.3.6	42.7.3
Amazon Redshift	redshift-jdbc42-2.1.0.16	redshift-jdbc42-2.1.0.29

The following are Spark connector upgrades:

Driver	Connector Version in AWS Glue 4.0	Connector Version in AWS Glue 5.0
Amazon Redshift	6.1.3	6.3.0
OpenSearch	1.0.1	1.2.0
MongoDB	10.0.4	10.3.0
Snowflake	2.12.0	3.0.0
BigQuery	0.32.2	0.32.2

Apache Spark highlights

Spark 3.5.2 in AWS Glue 5.0 brings a number of valuable features, which we highlight in this section. To learn more about the highlights and enhancements of Spark 3.4 and 3.5, refer to Spark Release 3.4.0 and Spark Release 3.5.0.

Apache Arrow-optimized Python UDF

Python user-defined functions (UDFs) enable users to build custom code for data processing needs, providing flexibility and accessibility. However, performance suffers because UDFs require serialization between Python and JVM processes. Spark 3.5’s Apache Arrow-optimized UDFs solve this by keeping data in shared memory using Arrow’s high-performance columnar format, eliminating serialization overhead and making UDFs efficient for large-scale processing.

To use Arrow-optimized Python UDFs, set spark.sql.execution.pythonUDF.arrow.enabled to True.

Python user-defined table functions

A user-defined table function (UDTF) is a function that returns an entire output table instead of a single value. PySpark users can now write custom UDTFs with Python logic and use them in PySpark and SQL queries. Called in the FROM clause, UDTFs can accept zero or more arguments, either as scalar expressions or table arguments. The UDTF’s return type, defined as either a StructType (for example, StructType().add("c1", StringType())) or DDL string (for example, c1: string), determines the output table’s schema.

RocksDB state store enhancement

At Spark 3.2, RocksDB state store provider has been added as a built-in state store implementation.

Changelog checkpointing

A new checkpoint mechanism for the RocksDB state store provider called changelog checkpointing persists the changelog (updates) of the state. This reduces the commit latency, thereby reducing end-to-end latency significantly.

You can enable this by setting spark.sql.streaming.stateStore.rocksdb.changelogCheckpointing.enabled to True.

You can also enable this feature with existing checkpoints.

Memory management enhancements

Although the RocksDB state store provider is well-known to be useful to address memory issues on the state, there was no fine-grained memory management. Spark 3.5 introduces more fine-grained memory management, which enables users to cap the total memory usage across RocksDB instances in the same executor process, enabling users to configure the memory usage per executor process.

Enhanced Structured Streaming

Spark 3.4 and 3.5 have many enhancements related to Spark Structured Streaming.

This new API deduplicates rows based on certain events. Watermark-based processing allows for more precise control over late data handling:

Deduplicate the same rows: dropDuplicatesWithinWatermark()
Deduplicate values on ‘value’ columns: dropDuplicatesWithinWatermark(['value'])
Deduplicate using the guid column with a watermark based on the eventTime column: withWatermark("eventTime", "10 hours") .dropDuplicatesWithinWatermark(["guid"])

Get started with AWS Glue 5.0

You can start using AWS Glue 5.0 through AWS Glue Studio, the AWS Glue console, the latest AWS SDK, and the AWS Command Line Interface (AWS CLI).

To start using AWS Glue 5.0 jobs in AWS Glue Studio, open the AWS Glue job and on the Job Details tab, choose the version Glue 5.0 – Supports Spark 3.5, Scala 2, Python 3.

To start using AWS Glue 5.0 on an AWS Glue Studio notebook or an interactive session through a Jupyter notebook, set 5.0 in the %glue_version magic:

%%glue_version 5.0

The following output shows that the session is set to use AWS Glue 5.0:

Setting Glue version to: 5.0

Conclusion

In this post, we discussed the key features and benefits of AWS Glue 5.0. You can create new AWS Glue jobs on AWS Glue 5.0 to get the benefit from the improvements, or migrate your existing AWS Glue jobs.

We would like to thank the support of numerous engineers and leaders who helped build Glue 5.0 that enables customers with a performance optimized Spark runtime and several new capabilities.

About the Authors

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.

Stuti Deshpande is a Big Data Specialist Solutions Architect at AWS. She works with customers around the globe, providing them strategic and architectural guidance on implementing analytics solutions using AWS. She has extensive experience in big data, ETL, and analytics. In her free time, Stuti likes to travel, learn new dance forms, and enjoy quality time with family and friends.

Martin Ma is a Software Development Engineer on the AWS Glue team. He is passionate about improving the customer experience by applying problem-solving skills to invent new software solutions, as well as constantly searching for ways to simplify existing ones. In his spare time, he enjoys singing and playing the guitar.

Anshul Sharma is a Software Development Engineer in AWS Glue Team.

Rajendra Gujja is a Software Development Engineer on the AWS Glue team. He is passionate about distributed computing and everything and anything about data.

Maheedhar Reddy Chappidi is a Sr. Software Development Engineer on the AWS Glue team. He is passionate about building fault tolerant and reliable distributed systems at scale. Outside of his work, Maheedhar is passionate about listening to podcasts and playing with his two-year-old kid.

Matt Su is a Senior Product Manager on the AWS Glue team. He enjoys helping customers uncover insights and make better decisions using their data with AWS Analytics services. In his spare time, he enjoys skiing and gardening.

Savio Dsouza is a Software Development Manager on the AWS Glue team. His team works on generative AI applications for the Data Integration domain and distributed systems for efficiently managing data lakes on AWS and optimizing Apache Spark for performance and reliability.

Kartik Panjabi is a Software Development Manager on the AWS Glue team. His team builds generative AI features for the Data Integration and distributed system for data integration.

Mohit Saxena is a Senior Software Development Manager on the AWS Glue and Amazon EMR team. His team focuses on building distributed systems to enable customers with simple-to-use interfaces and AI-driven capabilities to efficiently transform petabytes of data across data lakes on Amazon S3, and databases and data warehouses on the cloud.

AWS Big Data Blog