Why Glue?
With AWS Glue, you pay an hourly rate, billed by the second, for crawlers (discovering data) and extract, transform, and load (ETL) jobs (processing and loading data). For the AWS Glue Data Catalog, you pay a simplified monthly fee for storing and accessing the metadata. The first million objects stored are free, and the first million accesses are free. If you provision a development endpoint to interactively develop your ETL code, you pay an hourly rate, billed per second. For AWS Glue DataBrew, the interactive sessions are billed per session, and DataBrew jobs are billed per minute. Usage of the AWS Glue Schema Registry is offered at no additional charge.
Note: Pricing can vary by AWS Region.
-
ETL jobs and interactive sessions
-
Data Catalog
-
Crawlers
-
DataBrew interactive sessions
-
DataBrew jobs
-
Data Quality
-
Zero-ETL
-
ETL jobs and interactive sessions
-
Pricing examples
ETL job: Consider an AWS Glue Apache Spark job that runs for 15 minutes and uses 6 DPU. The price of 1 DPU-Hour is $0.44. Since your job ran for 1/4th of an hour and used 6 DPUs, AWS will bill you 6 DPU * 1/4 hour * $0.44, or $0.66.
AWS Glue Studio Job Notebooks and Interactive Sessions: Suppose you use a notebook in AWS Glue Studio to interactively develop your ETL code. An Interactive Session has 5 DPU by default. If you keep the session running for 24 minutes or 2/5th of an hour, you will be billed for 5 DPUs * 2/5 hour at $0.44 per DPU-Hour or $0.88.
-
Data Catalog
-
The AWS Glue Data Catalog is the centralized technical metadata repository for all your data assets across various data sources including Amazon S3, Amazon Redshift, and third-party data sources. The Data Catalog can be accessed from Amazon SageMaker Lakehouse for data, analytics, and AI. It provides a unified interface to organize data as catalogs, databases, and tables, and query them from Amazon Redshift, Amazon Athena, and Amazon EMR. AWS Lake Formation capabilities in the Data Catalog allow you to centralize data governance in AWS. Govern data assets using fine-grained data permissions and a familiar database-style features.
When using the Data Catalog, you are charged for storing and accessing table metadata and for running data processing jobs that compute table statistics and table optimizations.
Metadata pricing
With the Data Catalog, you can store up to a million metadata objects for free. If you store more than a million metadata objects, you will be charged $1.00 per 100,000 objects over a million, per month. A metadata object in the Data Catalog is a table, table version, partition, partition indexes, statistics, database, or catalog.
Table maintenance and statistics
The Data Catalog provides managed compaction for Apache Iceberg tables in Amazon S3 object storage, compacting small objects into larger ones for better read performance by AWS analytics services like Amazon Redshift, Athena, Amazon EMR, and AWS Glue ETL jobs. You are charged an hourly rate based on the number of Data Processing Units (or DPUs) used for table compaction. A single Data Processing Unit (DPU) provides 4 vCPU and 16 GB of memory. You are billed in increments of 1 second, rounded up to the nearest second, with a 1-minute minimum duration per run.
The Data Catalog also supports column-level table statistics for AWS Glue tables. These statistics are integrated with cost-based optimizer (CBO) in Amazon Athena and Amazon Redshift data lake querying, resulting in improved query performance and potential cost savings.
Optimization
- $0.44 per Data Processing Unit-Hour for optimizing Apache Iceberg tables, billed per second with a 1-minute minimum.
Statistics:
- $0.44 per Data Processing Unit-Hour for generating statistics, billed per second with a 1-minute minimum.
Additional usage and costs
Storage
Using the Data Catalog, you can create and manage tables in Amazon S3 and Amazon Redshift, and you are charged standard Amazon S3 or Amazon Redshift rates respectively for table storage. There are no additional storage charges in the Data Catalog.
1. When storing data in Amazon S3, you are charged standard Amazon S3 rates for storage, requests, and data transfer. See Amazon S3 pricing for more information.2. When storing data in Amazon Redshift, you are charged standard Amazon Redshift rates for storage. For details, visit Amazon Redshift pricing.
Compute
When you access Amazon Redshift tables from Amazon EMR, AWS Glue, Athena, or any open source or third-party Apache Iceberg–compatible engine, a service-managed Amazon Redshift Serverless workgroup is used for compute. The Amazon Redshift Serverless managed workgroup is used to filter table results, and you are charged for the compute resources you use based on standard Amazon Redshift Serverless rates. There are no separate charges to query tables stored in Amazon Redshift using Amazon Redshift. Visit Amazon Redshift pricing to learn more.
Lake Formation permissions
Lake Formation integrates with the Data Catalog and provides database-, table-, column-, row-, and cell-level permissions using tag-based or name-based access controls and cross-account sharing. There are no separate charges when creating Lake Formation permissions or using Lake Formation permissions with integrated AWS services.
Pricing examples
Data Catalog on the AWS Free Tier: Let’s consider that you store a million metadata objects in the Data Catalog in a given month and make 1 million metadata requests to access these tables. You pay $0 because your usage will be covered under the Data Catalog Free Tier. You can store the first million metadata objects and make a million metadata requests per month for free.
Data Catalog standard tier: Now consider that your metadata storage usage remains the same at 1 million metadata objects per month, but your requests double to 2 million metadata requests per month. Let’s say you also use crawlers to find new tables, and they run for 30 minutes and consume 2 DPUs.
Your storage cost is still $0, as the storage for your first million metadata objects is free. Your first million requests are also free. You will be billed for 1 million requests above the Data Catalog Free Tier, which is $1.
Using the Data Catalog with other services:
For example, when you query tables in Amazon Redshift using Athena SQL in SageMaker Lakehouse, you will be charged for: storing tables in Amazon Redshift based on standard Amazon Redshift pricing; the metadata request made to the Data Catalog based on standard Data Catalog request pricing; metadata storage for storing catalog, database, and table metadata in the Data Catalog; Amazon Redshift Serverless RPU-hours on a per-second basis (with a 60-second minimum charge) for filtering Amazon Redshift table results; and number of bytes scanned by the Athena query, rounded up to the nearest megabyte, with a 10 MB minimum per query data using standard Athena pricing.
In another scenario in which you query tables in Amazon Redshift using Amazon EMR Serverless, you will be charged for: storing tables in Amazon Redshift based on standard Amazon Redshift pricing; the metadata request made to the Data Catalog based on standard Data Catalog request pricing; metadata storage for storing catalog, database, and table metadata in the Data Catalog; Amazon Redshift Serverless RPU-hours on a per-second basis (with a 60-second minimum charge) for filtering Amazon Redshift table results; and the amount of vCPU, memory, and storage resources consumed by your workers in an Amazon EMR application.
In another scenario in which you query Apache Iceberg tables in Amazon S3 object storage using Amazon Redshift Serverless, you will be charged for: storing Apache Iceberg tables in Amazon S3 based on standard Amazon S3 pricing; the metadata request made to Data Catalog based on standard Data Catalog request pricing; metadata storage for storing catalog, database, and table metadata in the Data Catalog; and compute-hours (RPU hours) based on standard Amazon Redshift pricing.
AWS Glue Crawlers are billed at $0.44 per DPU-Hour, so you will pay for 2 DPUs * 1/2 hour at $0.44 per DPU-Hour, which equals $0.44.If you generate statistics for an AWS Glue table, and the statistics run takes 10 minutes and consumes 1 DPU, you will be billed 1 DPU * 1/6 hour * $0.44/DPU-Hour, which equals $0.07.
If you compact Apache Iceberg tables stored in Amazon S3 object storage, and the compaction runs for 30 minutes and consumes 2 DPUs, you will be billed 2 DPUs * 1/2 hour * $0.44/DPU-Hour, which equals $0.44.
- $0.44 per Data Processing Unit-Hour for optimizing Apache Iceberg tables, billed per second with a 1-minute minimum.
-
Crawlers
-
-
DataBrew interactive sessions
-
Pricing examples
AWS Glue DataBrew: The price for each 30 minutes interactive session is $1.00. If you start a session at 9:00 AM, immediately leave the console, and return from 9:20 AM–9:30 AM, this will use 1 session for a total of $1.00.
If you start a session at 9:00 AM and interact with the DataBrew console until 9:50 AM, exit the DataBrew project space, and come back to make your final interaction at 10:15 AM, this will use 3 sessions and you will be billed $1.00 per session for a total of $3.00.
-
DataBrew jobs
-
Pricing examples
AWS Glue DataBrew: If a DataBrew job runs for 10 minutes and consumes 5 DataBrew nodes, the price will be $0.40. Because your job ran for 1/6th of an hour and consumed 5 nodes, you will be billed 5 nodes * 1/6 hour * $0.48 per node hour for a total of $0.40.
-
Data Quality
-
AWS Glue Data Quality builds confidence in your data by helping you achieve high data quality. It automatically measures, monitors, and manages data quality in your data lakes and pipelines, making it easier to identify missing, stale, or bad data.
You can access data quality features from the Data Catalog and AWS Glue Studio and through AWS Glue APIs.
Pricing for managing data quality of datasets cataloged in the Data Catalog:You can choose a dataset from the Data Catalog and generate recommendations. This action will create a Recommendation Task for which you will provision data processing units (DPUs). After you get the recommendations, you can modify or add new rules and schedule them. These tasks are called Data Quality Tasks for which you will provision DPUs. You will require a minimum of 2 DPUs with a 1 minute minimum billing duration.
Pricing for managing data quality of datasets processed on AWS Glue ETL:You can also add data quality checks to your ETL jobs to prevent bad data from entering data lakes. These data quality rules will reside within your ETL jobs, resulting in increased runtime or increased DPU consumption. . Alternatively, you can use Flexible execution for non-SLA sensitive workloads.
Pricing for detecting anomalies in AWS Glue ETL:
Anomaly detection:
You will incur 1 DPU per statistic in addition to your ETL job DPUs for the time it takes to detect anomalies. On average, it takes between 10 -20 seconds to detect anomaly for 1 statistic. Let’s assume that you configured two Rules (Rule1: data volume must be greater than 1000 records, Rule2: column counts must be greater than 10) and one Analyzer (Analyzer 1: monitor completeness of a column). This configuration will generate three statistics: row count, column count, and completeness percentage of a column. You will be charged 3 additional DPUs for the time it takes to detect anomalies with a 1 second minimum. See example - 4 for more details.
Retraining:
You may want to exclude anomalous job runs or statistics so that the anomaly detection algorithm is accurately predicting subsequent anomalies. To do this, AWS Glue allows you to exclude or include statistics. You will incur 1 DPU to retrain the model for the time it takes to retrain. On an average, retraining takes 10 seconds to 20 minute per statistic. See example 5 for more details.
Statistics storage:
There is no charge to store the statistics that are gathered. There is a limit of 100K statistics per account and it will be stored for 2 years.
Additional charges:
AWS Glue processes data directly from Amazon Simple Storage Service (Amazon S3). There are no additional storage charges for reading your data with AWS Glue. You are charged standard Amazon S3 rates for storage, requests, and data transfer. Based on your configuration, temporary files, data quality results, and shuffle files are stored in an S3 bucket of your choice and are also billed at standard S3 rates.
If you use the Data Catalog, you are charged standard Data Catalog rates. For details, choose the Data Catalog storage and requests tab.
Pricing examples
Example 1 – Get recommendations for a table in the Data CatalogFor example, consider a recommendation task with 5 DPUs that completes in 10 minutes. You will pay 5 DPUs * 1/6 hour * $0.44, which equals to $0.37.
Example 2 – Evaluate data quality of a table in the Data CatalogAfter you review the recommendations, you can edit them if necessary and then schedule the data quality task by provisioning DPUs. For example, consider a data quality evaluation task with 5 DPUs that completes in 20 minutes.
You will pay 5 DPUs * 1/3 hour * $0.44, which equals $0.73.
Example 3 – Evaluate data quality in an AWS Glue ETL jobYou can also add these data quality checks to your AWS Glue ETL jobs to prevent bad data from entering your data lakes. You can do this by adding Data Quality Transform on AWS Glue Studio or using AWS Glue APIs within the code that you author in AWS Glue Studio notebooks. Consider an AWS Glue job that runs where data quality rules are configured within the pipeline, which executes 20 minutes (1/3 hour) with 6 DPUs. You will be charged 6 DPUs * 1/3 hour * $0.44, which equals $0.88. Alternatively, you can use Flex, for which you will be charged 6 DPUs * 1/3 hour * $0.29, which equals $0.58.
Example 4 – Evaluate data quality in an AWS Glue ETL job with Anomaly Detection
Consider an AWS Glue job that reads data from Amazon S3, transforms data and runs data quality checks before loading to Amazon Redshift. Assume that this pipeline had 10 rules and 10 analyzers resulting in 20 statistics gathered. Also, assume that the extraction, transformation process, loading, statistics gathering, data quality evaluation will take 20 minutes. Without Anomaly Detection enabled, customer will be charged 6 DPUs * 1/3 hour (20 minutes) * $0.44, which equals $0.88 (A). With Anomaly Detection turned on, we will add 1 DPU for every statistic and it will take 15 seconds on an average to detect anomalies. In this example, customer will incur 20 statistics * 1 DPU * 15/3600 (0.0041 hour /statistic) * $0.44 (cost per DPU/hour) = $0.037(B). Their total cost of the job will be $0.88 (A) + $0.037 (B) = $0.917.
Example 5 – RetrainingConsider that your Glue job detected an anomaly. You decide to exclude the anomaly from the model so that the anomaly detection algorithm predicts future anomalies accurately. To do this, you can retrain the model by excluding this anomalous statistic. You will incur 1 DPU per statistic for the time it takes to retrain the model. On an average, this can take 15 seconds. In this example, assuming you are excluding 1 data point, you will incur 1 statistic * 1 DPU * 15/3600 (0.0041 hour /statistic) * $0.44 = $0.00185.
-
Zero-ETL
-
Zero-ETL is a set of fully managed integrations by AWS that minimizes the need to build extract, transform, and load (ETL) data pipelines for common ingestion and replication use cases in your analytics and AI initiatives. AWS does not charge an additional fee for the zero-ETL integration. You pay for source and target resources used to create and process the changed data created as part of a zero-ETL integration.
Amazon SageMaker Lakehouse and Amazon Redshift support for zero-ETL integrations from applications
Amazon SageMaker Lakehouse and Amazon Redshift support for zero-ETL integrations from applications, which automate the extracting and loading of data from applications into Amazon SageMaker Lakehouse and Amazon Redshift. See AWS Glue zero-ETL documentation for the full list of supported zero-ETL sources.
AWS Glue charges a fee for the ingestion of application source data supported by zero-ETL integration. You pay for AWS Glue resources used to fetch inserts, updates, and deletes from your application. You are charged based on the volume of data received from the application, and are not charged for initiating the request to ingest data. Each ingestion request made by AWS Glue has a minimum volume of 1 megabyte (MB).
When the ingested data is written to Amazon Redshift, you pay for the resources used to process the changed data created as part of the zero-ETL integration based on Amazon Redshift pricing rates.
When the ingested data is written to SageMaker Lakehouse, you pay for the resources used to process the changed data created as part of the zero-ETL integration. The compute resource used is based on the storage type chosen for SageMaker Lakehouse.
- For Amazon Redshift managed storage, you are charged based on Amazon Redshift Serverless compute. For further information, see Amazon Redshift pricing.
- For Amazon Simple Storage Service (S3), you are charged based on AWS Glue compute per Data Processing Unit Hour (DPU Hour), billed per second with a 1- minute minimum.
Amazon DynamoDB zero-ETL integration with Amazon SageMaker Lakehouse
Amazon DynamoDB zero-ETL integration with Amazon SageMaker Lakehouse automates the extraction and loading of data, enabling analytics and AI for data from DynamoDB tables in the data lakehouse.
DynamoDB charges you a fee to export data from your DynamoDB continuous backups (point-in- time recovery). For further information, see Amazon DynamoDB pricing.
When the ingested data is written to Amazon SageMaker Lakehouse, you pay for the resources used to process the changed data created as part of the zero-ETL integration based on the storage type chosen for Amazon SageMaker Lakehouse.
- For Amazon Redshift managed storage, you are charged based on Amazon Redshift Serverless compute. For further information, see Amazon Redshift pricing.
- For Amazon Simple Storage Service (S3), you are charged based on AWS Glue compute per Data Processing Unit Hour (DPU Hour), billed per second with a 1- minute minimum.
Note: Pricing can vary by Region.
View the Global Regions table to learn more about AWS Glue availability.