AWS Big Data Blog
Introducing AWS Glue Data Catalog automation for table statistics collection for improved query performance on Amazon Redshift and Amazon Athena
The AWS Glue Data Catalog now automates generating statistics for new tables. These statistics are integrated with the cost-based optimizer (CBO) from Amazon Redshift Spectrum and Amazon Athena, resulting in improved query performance and potential cost savings.
Queries on large datasets often read extensive amounts of data and perform complex join operations across multiple datasets. When a query engine like Redshift Spectrum or Athena processes the query, the CBO uses table statistics to optimize it. For example, if the CBO knows the number of distinct values in a table column, it can choose the optimal join order and strategy. These statistics must be collected beforehand and should be kept up to date to reflect the latest data state.
Previously, the Data Catalog has supported collecting table statistics used by the CBO for Redshift Spectrum and Athena for tables with Parquet, ORC, JSON, ION, CSV, and XML formats. We introduced this feature and its performance benefits in Enhance query performance using AWS Glue Data Catalog column-level statistics. Additionally, the Data Catalog also has supported Apache Iceberg tables. We’ve also covered this in detail in Accelerate query performance with Apache Iceberg statistics on the AWS Glue Data Catalog.
Previously, creating statistics for Iceberg tables in the Data Catalog required you to continuously monitor and update configurations for your tables. You had to do undifferentiated heavy lifting to do the following:
- Discover new tables with specific data table formats (such as Parquet, JSON, CSV, XML, ORC, ION) and specific transactional data table formats such as Iceberg and their individual bucket paths
- Determine and set up compute tasks based on scan strategy (sampling percentage and schedules)
- Configure AWS Identity and Access Management (IAM) and AWS Lake Formation roles for specific tasks to provide specific Amazon Simple Storage Service (Amazon S3) access, Amazon CloudWatch logs, AWS Key Management Service (AWS KMS) keys for CloudWatch encryption, and trust policies
- Set up event notification systems to understand changes in data lakes
- Set up specific optimizer configuration-based query performance and storage improvement strategies
- Set up a scheduler or build your own event-based compute tasks with setup and teardown
Now, the Data Catalog lets you generate statistics automatically for updated and created tables with a one-time catalog configuration. You can get started by selecting the default catalog on the Lake Formation console and enabling table statistics on the table optimization configuration tab. As new tables are created, the number of distinct values (NDVs) are collected for Iceberg tables, and additional statistics such as the number of nulls, maximum, minimum, and average length are collected for other file formats such as Parquet. Redshift Spectrum and Athena can use the updated statistics to optimize queries, using optimizations such as optimal join order or cost-based aggregation pushdown. The AWS Glue console provides you visibility into the updated statistics and statistics generation runs.
Now, data lake administrators can configure weekly statistics collection across all databases and tables in their catalog. When the automation is enabled, the Data Catalog generates and updates column statistics for all columns in the tables on a weekly basis. This job analyzes 20% of records in the tables to calculate statistics. These statistics can be used by Redshift Spectrum and Athena CBO to optimize queries.
Furthermore, this new feature provides the flexibility to configure automation settings and scheduled collection configurations at the table level. Individual data owners can override catalog-level automation settings based on specific requirements. Data owners can customize settings for individual tables, including whether to enable automation, collection frequency, target columns, and sampling percentage. This flexibility allows administrators to maintain an optimized platform overall, while enabling data owners to fine-tune individual table statistics.
In this post, we discuss how the Data Catalog automates table statistics collection and how you can use it to enhance your data platform’s efficiency.
Enable catalog-level statistics collection
The data lake administrator can enable catalog-level statistics collection on the Lake Formation console. Complete the following steps:
- On the Lake Formation console, choose Catalogs in the navigation pane.
- Select the catalog that you want to configure, and choose Edit on the Actions menu.
- Select Enable automatic statistics generation for the tables of the catalog and choose an IAM role. For the required permissions, see Prerequisites for generating column statistics.
- Choose Submit.
You can also enable catalog-level statistics collection through the AWS Command Line Interface (AWS CLI):
The command calls the AWS Glue UpdateCatalog
API, which takes in a CatalogProperties
structure that expects the following key-value pairs for catalog-level statistics:
- ColumnStatistics.RoleArn – The IAM role Amazon Resource Name (ARN) to be used for all jobs triggered for catalog-level statistics
- ColumnStatistics.Enabled – A Boolean value indicating whether the catalog-level settings are enabled or disabled
Callers of UpdateCatalog
must have UpdateCatalog
IAM permissions and be granted ALTER
on CATALOG
permissions on the root catalog if using Lake Formation permissions. You can call the GetCatalog
API to verify the properties that are set to your catalog properties. For the required permissions used by the role passed, see Prerequisites for generating column statistics.
By following these steps, catalog-level statistics collection is enabled. AWS Glue then automatically updates statistics for all columns in each table, sampling 20% of records on a weekly basis. This allows data lake administrators to effectively manage the data platform’s performance and cost-efficiency.
View automated table-level settings
When catalog-level statistics collection is enabled, when an Apache Hive table or Iceberg table is created or updated using the AWS Glue CreateTable
or UpdateTable
APIs through the AWS Glue console, AWS SDK, or AWS Glue crawlers, an equivalent table level setting is created for that table.
Tables with automatic statistics generation enabled must follow one of following properties:
- HIVE table formats such as Parquet, Avro, ORC, JSON, ION, CSV, and XML
- Apache Iceberg table format
After a table has been created or updated, you can confirm that a statistics collection setting has been set by checking the table description on the AWS Glue console. The setting should have the Schedule property set as Auto and Statistics configuration set as Inherited from catalog. Any table setting with the following settings is automatically triggered by AWS Glue internally.
The following is an image of a Hive Table where catalog-level statistics collection has been applied and statistics have been collected:
The following is an image of a Iceberg Table where catalog-level statistics collection has been applied and statistics have been collected:
Configure table-level statistics collection
Data owners can customize statistics collection at the table level to meet specific needs. For frequently updated tables, statistics can be refreshed more often than weekly. You can also specify target columns to focus on those most commonly queried.
Moreover, you can set what percentage of table records to use when calculating statistics. Therefore, you can increase this percentage for tables that need more precise statistics, or decrease it for tables where a smaller sample is sufficient to optimize costs and statistics generation performance.
These table-level settings can override the catalog-level settings previously described.
To configure table-level statistics collection on AWS Glue console, complete the following steps:
- On the AWS Glue console, choose Databases under Data Catalog in the navigation pane.
- Choose a database to view all available tables (for example,
optimization_test
). - Choose the table to be configured (for example,
catalog_returns
). - Go to Column statistics and choose Generate on schedule.
- In the Schedule section, choose the frequency from Hourly, Daily, Weekly, Monthly and Custom (cron expression). In this example, for Frequency, choose Daily.
- For Start time, enter
06:43
in UTC.
- For Column options, select All columns.
- For IAM role, choose an existing role, or create a new role. For the required permissions, see Prerequisites for generating column statistics.
- Under Advanced configuration, for Security configuration, optionally choose your security configuration to enable at-rest encryption on the logs pushed to CloudWatch.
- For Sample rows, enter
100
as the percentage of rows to sample. - Choose Generate statistics.
In the table description on the AWS Glue console, you can confirm that a statistics collection job has been scheduled for the specified date and time.
By following these steps, you have configured table-level statistics collection. This allows data owners to manage table statistics based on their specific requirements. Combining this with catalog-level settings by data lake administrators enables securing a baseline for optimizing the entire data platform while flexibly addressing individual table requirements.
You can also create a column statistics generation schedule through the AWS CLI:
The required parameters are database-name
, table-name
, and role
. You can also include optional parameters such as schedule
, column-name-list
, catalog-id
, sample-size
, and security-configuration
. For more information, see Generating column statistics on a schedule.
Conclusion
This post introduced a new feature in the Data Catalog that enables automated statistics collection at the catalog level with flexible per-table controls. Organizations can effectively manage and maintain up-to-date column-level statistics. By incorporating these statistics, CBO in both Redshift Spectrum and Athena can optimize query processing and cost-efficiency.
Try out this feature for your own use case, and let us know your feedback in the comments.
About the Authors
Sotaro Hikita is an Analytics Solutions Architect. He supports customers across a wide range of industries in building and operating analytics platforms more effectively. He is particularly passionate about big data technologies and open source software.
Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He works based in Tokyo, Japan. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.
Kyle Duong is a Senior Software Development Engineer on the AWS Glue and AWS Lake Formation team. He is passionate about building big data technologies and distributed systems.
Sandeep Adwankar is a Senior Product Manager at AWS. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure, and access data.