AWS Glue Documentation

Data Discovery

Discover and search

The AWS Glue Data Catalog is designed to be a persistent metadata store for all your AWS data assets. The Data Catalog contains table definitions, job definitions, schemas, and other control information to help you manage your AWS Glue environment. It is designed to compute statistics and register partitions to help make queries against your data efficient and effective. It is also designed to maintain a schema version history so you can understand how your data has changed over time.

Schema discovery

AWS Glue crawlers are designed to connect to your source or target data store, progress through a prioritized list of classifiers to determine the schema for your data, and then create metadata in your AWS Glue Data Catalog. The metadata stored in tables in your data catalog can be used in the authoring process of your ETL jobs. You can run crawlers on a schedule, on-demand, or trigger them based on an event.

Manage and enforce schemas for data streams

AWS Glue Schema Registry, a serverless feature of AWS Glue, helps you validate and control the evolution of streaming data using registered Apache Avro schemas. Through Apache-licensed serializers and deserializers, the Schema Registry is designed to integrate with Java applications. When you integrate data streaming applications with the Schema Registry, it can help improve data quality and assist in safeguarding against unexpected changes using compatibility checks that govern schema evolution. Additionally, you can create or update AWS Glue tables and partitions using schemas stored within the registry.

Scale based on workload

AWS Glue Autoscaling, a serverless feature in AWS Glue, is designed to dynamically scale resources up and down based on workload. With Autoscaling, your job is assigned workers when needed. As the job progresses, and it goes through advanced transforms, AWS Glue is designed to add and remove resources depending on how much it can split up the workload.

Data Preparation

Deduplicate and cleanse data

AWS Glue can help clean and prepare your data for analysis. Its FindMatches feature is designed to deduplicate and find records that are imperfect matches of each other. FindMatches will just ask you to label sets of records as either “matching” or “not matching.” The system is designed to learn your criteria for calling a pair of records a “match” and build an ETL job that you can use to help you find duplicate records within a database or matching records across two databases.

Edit, debug, and test ETL code with developer endpoints

If you choose to interactively develop your ETL code, AWS Glue provides development endpoints for you to edit, debug, and test the code it generates for you. You can use your favorite IDE or notebook. You can write custom readers, writers, or transformations and import them into your AWS Glue ETL jobs as custom libraries.

Normalize data

AWS Glue DataBrew provides an interactive, point-and-click visual interface to help users like data analysts and data scientists clean and normalize data. It is designed to allow you to visualize, clean, and normalize data directly from your data lake, data warehouses, and databases.

Define, detect, and remediate sensitive data

AWS Glue Sensitive Data Detection enables you to define, identify, and process sensitive data in your data pipeline and data lake. Once identified, you can remediate sensitive data by redacting, replacing, or reporting on personally identifiable information (PII) data and other types of data deemed sensitive.

Create custom visual transforms

AWS Glue helps you create custom visual transforms so you can define, reuse, and share ETL logic. AWS Glue Custom Visual Transforms is designed to allow data engineers to write and share business-specific Apache Spark logic, making it simpler to keep ETL jobs up to date.

Modernize Apache Spark jobs with GenAI upgrades

AWS Glue provides generative AI capabilities designed to help analyze your Spark jobs and generate upgrade plans to newer versions.

Accelerate debugging with GenAI troubleshooting

AWS Glue uses generative AI to help identify and resolve issues in Spark jobs. It is designed to analyze job metadata, execution logs, and configurations to provide root cause analysis and actionable recommendations.

Integrate

Data Integration job development

AWS Glue Interactive Sessions, a serverless feature of job development, is designed to assist with the development of data integration jobs. AWS Glue Interactive Sessions enables data engineers to interactively explore and prepare data. Engineers can explore, experiment on, and process data interactively using a supported IDE or notebook.

Job Notebooks

AWS Glue Studio Job Notebooks provides serverless notebooks in AWS Glue Studio. Glue Studio Job Notebooks provides a built-in interface for AWS Glue Interactive Sessions that enables users to save and schedule their notebook code as AWS Glue jobs.

Build ETL pipelines

AWS Glue jobs is designed to be invoked on a schedule, on demand, or based on an event. You can start multiple jobs in parallel or specify dependencies across jobs to build ETL pipelines. AWS Glue is designed to handle inter-job dependencies, filter bad data, and retry jobs if they fail. Logs and notifications are designed to be pushed to Amazon CloudWatch, so you can monitor and get alerts from a central service.

Nonurgent workloads

AWS Glue Flex is a flexible execution job class that is designed to run your nonurgent data integration workloads. AWS Glue has two job execution classes: standard and flexible. The standard execution class is designed for time-sensitive workloads that require fast job startup and dedicated resources. AWS Glue Flex is designed for non-time-sensitive jobs whose start and completion times may vary.

Open-source frameworks

AWS Glue natively supports open-source frameworks that are designed to help you manage data in a transactionally consistent manner.

Deliver data across your data lakes and pipelines

AWS Glue Data Quality is designed to measure, monitor, and manage data quality in your data lakes and pipelines. It is also designed to computes statistics, recommend quality rules, monitor, and alert you when quality deteriorates.

Enforce fine grain access control on your data lake

AWS Glue 5.0 and later helps with security and governance over transactional data lakes by providing access controls at table, column and row level permissions.

Integrate data from multiple sources without operational overhead

AWS Glue provides zero-ETL integration that connects multiple data sources to your analytics environment. Zero-ETL integration is designed to help with building, operating, and maintaining traditional data pipelines. Changes made at the source are designed to be replicated to help keep your data current.

Transform

Visually transform data with a drag-and-drop interface

AWS Glue Studio helps you to author scalable ETL jobs for distributed processing. You can define your ETL process in the drag-and-drop job editor and AWS Glue is designed to generate the code to extract, transform, and load your data.

Clean and transform streaming data

Serverless streaming ETL jobs in AWS Glue are designed to consume data from streaming sources, clean and transform it, and make it available for analysis in your target data store. AWS Glue streaming ETL jobs can help you enrich and aggregate data, join batch and streaming sources, and run a variety of analytics and machine learning operations.

Optimize

Compaction

AWS Glue Data Catalog is designed to support compaction strategies: binpack, sort, and z-order. While binpack compaction strategy is designed to optimize file size, sort compaction is designed to assist with query execution by reducing file scans, and z-order is designed to enable multi-dimensional file pruning.

Snapshot retention

AWS Glue Data Catalog is designed to support snapshot retention optimizer that can help manage storage overhead by retaining only snapshots that are needed and removing older, unnecessary snapshots and their associated underlying files.

Unreferenced file deletion

AWS Glue Data Catalog is designed to support periodically identifying and removing unnecessary unreferenced files, freeing up storage.

Apache Iceberg Statistics

AWS Glue Data Catalog is designed to support calculating and updating number of distinct values (NDVs) for each column in Iceberg tables to help with query optimization, data management, and performance efficiency for data engineers and scientists working with large-scale datasets.

Additional Information

For additional information about service controls, security features and functionalities, including, as applicable, information about storing, retrieving, modifying, restricting, and deleting data, please see https://docs.aws.amazon.com/index.html. This additional information does not form part of the Documentation for purposes of the AWS Customer Agreement available at http://aws.amazon.com/agreement, or other agreement between you and AWS governing your use of AWS’s services.