Why Glue?
AWS Glue is a serverless, scalable data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning, and application development. With generative AI assistance, AWS Glue provides all the capabilities needed for data integration, so you can gain insights and put your data to use in minutes instead of months. With AWS Glue, there is no infrastructure to set up or manage. You pay only for the resources consumed while your jobs are running.
Discover
Discover and search across all your AWS datasets
The AWS Glue Data Catalog is your persistent metadata store for all your data assets, regardless of where they are located. The Data Catalog contains table definitions, job definitions, schemas, and other control information to help you manage your AWS Glue environment. It automatically computes statistics and registers partitions to make queries against your data efficient and cost-effective. It also maintains a comprehensive schema version history so you can understand how your data has changed over time.
Automatic schema discovery
AWS Glue crawlers connect to your source or target data store, progress through a prioritized list of classifiers to determine the schema for your data, and then create metadata in your AWS Glue Data Catalog. The metadata is stored in tables in your Data Catalog and used in the authoring process of your extract, transform, and load (ETL) jobs. You can run crawlers on a schedule, on demand, or trigger them based on an event to ensure that your metadata is up to date.
Manage and enforce schemas for data streams
AWS Glue Schema Registry, a serverless feature of AWS Glue, helps you validate and control the evolution of streaming data using registered Apache Avro schemas at no additional charge. Through Apache-licensed serializers and deserializers, AWS Glue Schema Registry integrates with Java applications developed for Apache Kafka, Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon Kinesis Data Streams, Apache Flink, Amazon Kinesis Data Analytics for Apache Flink, and AWS Lambda. When data streaming applications are integrated with AWS Glue Schema Registry, you can improve data quality and safeguard against unexpected changes using compatibility checks that govern schema evolution. Additionally, you can create or update AWS Glue tables and partitions using schemas stored within the registry.
Automatically scale based on workload
Auto Scaling, a serverless feature in AWS Glue, dynamically scales resources up and down based on workload. With Auto Scaling, your job is assigned workers only when needed. As the job progresses, and it goes through advanced transforms, AWS Glue adds and removes resources depending on how much it can split up the workload. You no longer need to worry about over-provisioning resources, spending time optimizing the number of workers, or paying for idle resources.
Prepare
Deduplicate and cleanse data with built-in machine learning (ML)
AWS Glue helps clean and prepare your data for analysis without you having to become an ML expert. Its FindMatches feature deduplicates and finds records that are imperfect matches of each other. For example, use FindMatches to find duplicate records in your database of restaurants, such as when one record lists “Joe's Pizza” at “121 Main St.” and another shows a “Joseph's Pizzeria” at “121 Main.” FindMatches will ask you to label sets of records as either “matching” or “not matching.” The system will then learn your criteria for calling a pair of records a “match” and will build an ETL job that you can use to find duplicate records within a database or matching records across two databases.
Edit, debug, and test ETL code with Interactive Sessions
If you choose to interactively develop your ETL code, AWS Glue provides development endpoints for you to edit, debug, and test the code it generates for you. You can use your favorite integrated development environment (IDE) or notebook. You can write custom readers, writers, or transformations and import them into your AWS Glue ETL jobs as custom libraries. You can also use and share code with other developers in our GitHub repository. AWS Glue Interactive Sessions, a serverless feature of job development, simplifies the development of data integration jobs. Engineers can also explore, experiment on, and process data interactively using the IDE or notebook of their choice.
Normalize data without code using a visual interface
AWS Glue DataBrew provides an interactive, point-and-click visual interface for users like data analysts and data scientists to clean and normalize data without writing code. You can easily visualize, clean, and normalize data directly from your data lake, data warehouses, and databases, including Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon Aurora, and Amazon Relational Database Service (Amazon RDS). You can choose from over 250 built-in transformations to combine, pivot, and transpose the data, and automate data preparation tasks by applying saved transformations directly to the new incoming data.
Define, detect, and remediate sensitive data
AWS Glue Sensitive Data Detection helps you define, identify, and process sensitive data in your data pipeline and data lake. Once identified, you can remediate sensitive data by redacting, replacing, or reporting on personally identifiable information (PII) data and other types of data deemed sensitive. AWS Glue Sensitive Data Detection simplifies the identification and masking of sensitive data, including PII such as name, Social Security number, address, email, and driver’s license.
Scale existing Python code with Ray
Developers like Python for its ease of use and rich collection of built-in data processing libraries. They want to use familiar Python primitive type for processing large datasets. AWS Glue for Ray helps data engineers process large datasets using Python and popular Python libraries. AWS Glue for Ray uses Ray.io, an open-source unified compute framework that helps scale Python workloads from a single node to hundreds of nodes. AWS Glue for Ray is serverless, so there is no infrastructure to manage.
Create custom visual transforms
AWS Glue helps you create custom visual transforms so you can define, reuse, and share ETL logic. With AWS Glue Custom Visual Transforms, data engineers can write and share business-specific Apache Spark logic, reducing dependence on Spark developers and making it simpler to keep ETL jobs up to date. These transforms are available to all jobs in your AWS account, whether visual or code-based.
Modernize Apache Spark jobs with GenAI upgrades (preview)
AWS Glue provides generative AI capabilities to automatically analyze your Spark jobs and generate upgrade plans to newer versions. This reduces the time and effort needed to keep your Spark jobs modern, secure and performant by automating the identification and updating of scripts and configurations.
Accelerate debugging with GenAI troubleshooting (preview)
AWS Glue uses generative AI to quickly identify and resolve issues in Spark jobs. It analyzes job metadata, execution logs, and configurations to provide root cause analysis and actionable recommendations, reducing troubleshooting time from days to minutes.
Integrate
Simplify data integration job development
AWS Glue Interactive Sessions, a serverless feature of job development, simplifies the development of data integration jobs. With AWS Glue Interactive Sessions, data engineers can interactively explore and prepare data. Engineers can also explore, experiment on, and process data interactively using the IDE or notebook of their choice.
Built-in job notebooks
AWS Glue Studio Job Notebooks provide serverless notebooks with minimal setup in AWS Glue Studio, so developers can get started quickly. With AWS Glue Studio Job Notebooks , you have access to a built-in interface for AWS Glue Interactive Sessions where you can save and schedule your notebook code as AWS Glue jobs.
Build complex ETL pipelines with simple job scheduling
AWS Glue jobs can be invoked on a schedule, on demand, or based on an event. You can start multiple jobs in parallel or specify dependencies across jobs to build complex ETL pipelines. AWS Glue will handle all inter-job dependencies, filter bad data, and retry jobs if they fail. All logs and notifications are pushed to Amazon CloudWatch, so you can monitor and get alerts from a central service. Amazon Managed Workflows for Apache Airflow (MWAA) is a managed service for Apache Airflow that lets you use your current, familiar Apache Airflow platform to orchestrate your workflows. Using MWAA, you can orchestrate multiple ETL processes that use diverse technologies within a complex ETL workflow.
Apply and deploy DevOps best practices with Git integration
AWS Glue integrates with Git, the widely used open-source version-control system. You can use GitHub and AWS CodeCommit to maintain a history of changes to your AWS Glue jobs and apply existing DevOps practices to deploy them. Git integration in AWS Glue works for all AWS Glue job types, whether visual or code-based. It includes built-in integration with both GitHub and CodeCommit and also makes it simpler to use automation tools like Jenkins and AWS CodeDeploy to deploy AWS Glue jobs.
Reduce costs for nonurgent workloads with flexible job execution
AWS Glue Flex is a flexible execution job class that allows you to reduce the cost of your nonurgent data integration workloads (for example, preproduction jobs, testing, and data loads) by up to 35%. AWS Glue has two job execution classes: standard and flexible. The standard execution class is ideal for time-sensitive workloads that require fast job startup and dedicated resources. AWS Glue Flex is appropriate for non-time-sensitive jobs whose start and completion times may vary.
Read, insert, update, and delete files in your data lake
AWS Glue natively supports three open-source frameworks, including Apache Hudi, Apache Iceberg, and Linux Foundation Delta Lake. These frameworks help you manage data in a transactionally consistent manner for use in your Amazon S3 based data lake.
Deliver high-quality data across your data lakes and pipelines
AWS Glue Data Quality helps you improve your data quality and confidence. It automatically measures, monitors, and manages data quality in your data lakes and pipelines. It also automatically computes statistics, recommends quality rules, monitors, and alerts you when quality deteriorates, making it easier to identify missing, stale, or bad data before it impacts your business.
Enforce fine grain access control on your data lake
AWS Glue 5.0 and later helps simplify security and governance over transactional data lakes by providing access controls at table, column and row level permissions with your Apache Spark jobs accessing Apache Iceberg, Apache Hudi, and Delta tables.
Transform
Visually transform data with a drag-and-drop interface
AWS Glue Studio allows you to author highly scalable ETL jobs for distributed processing without becoming an Apache Spark expert. Define your ETL process in the drag-and-drop job editor and AWS Glue automatically generates the code to extract, transform, and load your data. The code is generated in Scala or Python and written for Apache Spark.
Generate ETL code with Amazon Q Data Integration
Create ETL jobs using natural language with Amazon Q Data Integration in AWS Glue. Simply describe your data transformation needs, and get automatically generated Apache Spark code that you can customize, test, and deploy as production jobs.
Clean and transform streaming data in-flight
Serverless streaming ETL jobs in AWS Glue continuously consume data from streaming sources including Amazon Kinesis and Amazon MSK, clean and transform it in-flight, and make it available for analysis in seconds in your target data store. Use this feature to process event data like IoT event streams, clickstreams, and network logs. AWS Glue streaming ETL jobs can enrich and aggregate data, join batch and streaming sources, and run a variety of complex analytics and machine learning operations.
Optimize
Optimization of Apache Iceberg Tables
AWS Glue Data Catalog support optimization of Apache Iceberg tables.
Compaction
AWS Glue Data Catalog supports data compaction that compacts small data files to reduce storage usage and improve read performance.
Snapshot retention
AWS Glue Data Catalog supports snapshot retention optimizer that can help manage storage overhead by retaining only snapshots that are needed and removing older, unnecessary snapshots and their associated underlying files.
Unreferenced file deletion
AWS Glue Data Catalog supports periodically identify and remove unnecessary unreferenced files, freeing up storage.
Apache Iceberg Statistics
AWS Glue Data Catalog supports calculating and updating number of distinct values (NDVs) for each column in Iceberg tables resulting in better query optimization, data management, and performance efficiency for data engineers and scientists working with large-scale datasets.
Optimizing query performance for Glue Data Catalog tables
AWS Glue Data Catalog supports column-level statistics in data formats such as Parquet, ORC, JSON, ION, CSV, and XML. AWS analytical services such as Amazon Redshift and Amazon Athena can use these column statistics to generate query execution plans, and choose the optimal plan that improves query performance.