AWS Storage Blog
Analyzing Amazon S3 Metadata with Amazon Athena and Amazon QuickSight
Object storage provides virtually unlimited scalability, but managing billions, or even trillions, of objects can pose significant challenges. How do you know what data you have? How can you find the right datasets at the right time?
By implementing a robust metadata management strategy, you can answer these questions, gain better control over massive data lakes, improve data discoverability, accessibility, and extract more value from your data. AWS introduced Amazon S3 Metadata (Preview) at re:Invent 2024, which delivers queryable object metadata in near real-time to organize your data and accelerate data discovery. S3 Metadata helps you to curate, identify, and use your S3 data for business analytics, real-time inference applications, and more.
In this post, we walk through how to set up and use S3 Metadata, how to derive actionable insights using simple SQL queries from Amazon Athena, and then how to visualize your data from Amazon QuickSight. We show you some example queries and dashboards that you can create to insights into and visualizations of your S3 Metadata. You can use these queries or create your own to understand what data you have and to find the right data sets when you need them.
Amazon S3 Metadata (Preview)
S3 Metadata delivers queryable object metadata in near real-time to organize your data and accelerate data discovery. S3 Metadata includes new fields which aren’t available today from other sources such as S3 Inventory reports or ListObjectsV2 API. For a detailed comparison, refer to the following table.
Metadata | Available in S3 Inventory | Available in ListObjectsV2 | Available in S3 Metadata |
bucket | Y | Y | Y |
key | Y | Y | Y |
sequence_number | N | N | Y |
record_type | N | N | Y |
record_timestamp | N | N | Y |
version_id | Y | Y | Y |
is_delete_marker | Y | N | Y |
size | Y | Y | Y |
last_modified_date | Y | Y | Y |
e_tag | Y | Y | Y |
storage_class | Y | Y | Y |
is_multipart | Y | N | Y |
encryption_status | Y | Y | Y |
is_bucket_key_enabled | Y | N | Y |
kms_key_arn | N | N | Y |
checksum_algorithm | Y | Y | Y |
object_tags | N | N | Y |
user_metadata | N | N | Y |
requester | N | N | Y |
source_ip_address | N | N | Y |
request_id | N | N | Y |
object_lock_retain_until_date | Y | N | N |
object_lock_mode | Y | N | N |
object_lock_legal_hold_status | Y | N | N |
intelligent_tiering_access_tier | Y | N | N |
bucket_key_status | Y | N | N |
object_access_control_list | Y | N | N |
object_owner | N | Y | N |
replication_status | Y | N | N |
restore_status | N | Y | N |
The metadata table contains mutation events including object creation, metadata update such as tag update, and object deletion. In the metadata table, each row is a change event and each column is a metadata category. S3 Metadata offers at least one delivery of each event, so there may be duplicate records.
Prerequisites
There are a few prerequisites before you get started:
1. Have or create an IAM role to use with S3, Glue, Lake Formation, Athena, and QuickDight. For more information, see IAM roles.
2. Ensure that you have access to Athena from your account.
3. Have or create an S3 general purpose bucket that has traffic. Since S3 Metadata collects mutations, you will need an S3 bucket that has PUT and/or DELETE traffic.
Walkthrough
From a high level, we:
1. Enable S3 Metadata on your general purpose S3 bucket
2. Run SQL queries on Athena to gain insights on S3 Metadata
3. Configure QuickSight to visualize S3 Metadata
1. Enable S3 Metadata on your general purpose S3 bucket
In this section, we walk through the following:
1.1. Create an S3 table bucket to hold the S3 Metadata Iceberg table where the metadata will be stored and enable integration with Glue and Lake Formation.
1.2. Configure S3 Metadata to collect data on a general purpose S3 bucket.
1.1. Create a table bucket and enable integration with AWS analytics services
S3 Metadata uses S3 Tables to store the output table. S3 Tables are purpose-built for storing tabular data at scale. S3 Tables’ preview integration with AWS Glue Data Catalog allows you to query and visualize data using AWS Analytics services such as Athena and QuickSight. Let’s create a table bucket to store the metadata table.
To get started, navigate to S3 in the AWS Management Console. On the left navigation menu, select Table buckets, and then Create table bucket.
Enter a name for the table bucket, ensuring the Enable Integration checkbox is selected for integration with AWS Analytics services. Select the Create table bucket button to finalize the table bucket creation.
Once the S3 table bucket is created, you will see this screen.
1.2. Configure S3 Metadata
Once you’ve created the table bucket to store the metadata table, you can now navigate to a general purpose bucket to configure S3 Metadata to start producing a metadata table. This is the bucket that you are interested in having S3 Metadata automatically surface object metadata. Once you have navigated to the bucket in the S3 Console, click the Metadata tab. Then, click Create metadata configuration.
Select Browse S3, select the S3 table bucket created in the previous step, and click Choose.
Accept the default table name or enter another name for the metadata table as you prefer. Click Create metadata configuration.
2. Run SQL queries on Amazon Athena to gain insights on S3 Metadata
In this section, we walk through the following:
2.1. Configure Lake Formation permissions. Here we will be providing access to the console role that will have access to the S3 Metadata table.
2.2. Verify access to S3 Metadata in Athena. You will find the catalog and database in the S3 Metadata configuration page.
2.3. Explore your S3 Metadata with Athena. You will try some sample queries to better understand what kinds of insights are possible.
2.1. Configure Lake Formation permissions
Once you’ve set up the metadata table, you will need to grant permissions for users to access the table if you want them to query the table from AWS services like Athena and QuickSight. Permissions are granted using Lake Formation.
Go to Lake Formation in the AWS Management Console. If this is the first time you are using Lake Formation, you will see a Welcome window. Select Add myself and optionally add any other users or roles that should be data lake administrators. Then click Get started.
Once loaded, select Catalogs from the left menu, and then s3tablescatalog. This is the federated catalog that contains table buckets. For more information on the integration, refer to the S3 Tables documentation on integration with AWS Analytics services.
Select the radio button next to the S3 Table Bucket created in the earlier step, and select Actions, then Grant.
In the Principals section, select the IAM users and roles you want to grant access to.
Note: Lake Formation also integrates with IAM Identity Center SAML (Security Assertion Markup Language) users and groups, and external accounts. Configuration of these options will not be covered as part of this post.
In the LF-Tags or catalog resources section, we will grant access to the metadata table you just created. On this screen, Named Data Catalog resources should already be selected, as should the s3tablescatalog/<<your s3 table bucket>> as the catalog. If s3tablescatalog/<<your s3 table bucket>> is not selected, select it from the dropdown menu and deselect any other catalogs. From the Databases section, select aws_s3_metadata from the dropdown menu. From Tables, select the s3 metadata table you just created to grant access or select All Tables if you want to create more S3 Metadata tables in the same S3 table bucket.
In the Table permissions section, we will configure what permission the Principal will have on the table, here we will check Select and Describe in Table and Grantable permissions. For data permissions, keep the default of All data access and click Grant to finish.
2.2. Verify access to S3 Metadata in Athena
Once the Lake Formation permissions have been added, navigate to the Athena Console. If this is the first time you are using Athena, use this guide to get started with Athena. You can follow the guide up to and including step 7.
On the left side of the editor, select the Catalog and Database for your S3 Metadata configuration.
You can find the values for Catalog and Database on the S3 Metadata configuration page for your bucket.
Once you have set up Athena, you can run the following query to verify that everything is working correctly.
SELECT * FROM <<table name>> LIMIT 10;
2.3. Explore your S3 Metadata with Athena
Athena is an interactive query service that makes it easy to analyze data in S3 using standard SQL. Here is a series of sample queries you can use to analyze your S3 Metadata from Athena. In the examples below, we have used s3_metadata_primary as the table name. Replace that with your own S3 Metadata table name.
Find objects that match a pattern.
SELECT key FROM s3_metadata_primary where key LIKE '%.jpg'
How up-to-date is the table at this time (timestamp in GMT)?
SELECT MAX(last_modified_date) FROM s3_metadata_primary;
What objects were deleted in the last 24 hours?
SELECT bucket, key, version_id, last_modified_date from s3_metadata_primary where last_modified_date >= (current_date - interval '1' day) and record_type = ‘DELETE’;
How many objects has S3 Lifecycle expired in the last 7 days?
SELECT bucket, key, version_id, last_modified_date, record_timestamp from s3_metadata_primary where requester = 's3.amazonaws.com' and record_type = 'DELETE' and record_timestamp > (current_date - interval '7' day);
What IP address(es) have recent requests come from?
SELECT source_ip_address, count(source_ip_address) from s3_metadata_primary group by source_ip_address;
What Key Management Service (AWS KMS) keys have been used for PUTs in the last 7 days?
SELECT DISTINCT kms_key_arn FROM s3_metadata_primary where record_timestamp > (current_date - interval '7' day);
What objects are not encrypted with KMS?
SELECT key, encryption_status FROM s3_metadata_primary WHERE encryption_status != 'SSE-KMS';
Create a distribution of object tags.
SELECT object_tags, count(object_tags)
FROM s3_metadata_primary group by object_tags;
Are there any objects without tags?
SELECT key, last_modified_date, object_tags FROM s3_metadata_primary
WHERE cardinality(object_tags) = 0 ORDER BY last_modified_date DESC;
Find objects in a given prefix in the S3 Glacier Deep Archive storage class and create a CSV file that could be used for a batch job to restore those objects.
SELECT bucket, key, version_id FROM s3_metadata_primary where key LIKE ‘legal/%’ and storage_class = ‘DEEP_ARCHIVE’;
3. Visualize your S3 Metadata in QuickSight
Amazon QuickSight is a powerful cloud-based business intelligence (BI) service that empowers organizations of all sizes to unlock the value of their data. You can use QuickSight to create dashboards and visualizations using data stored in S3 Metadata tables. The following are some examples of dashboards you can create:
- What percentage of your storage data is small objects compared to large object?
- What is the percentage of object uploads compared to deletions?
- What percentage of images files are in my bucket?
In this section, we walk through the following:
3.1. Configure permissions for the QuickSight service role. You need to give the QuickSight service role the Glue:GetCatalog permission.
3.2. Configure Lake Formation permissions to the QuickSight admin user. You will grant your QuickSight user with permissions to your database and the tables within your database.
3.3. Create a QuickSight dashboard. You will use custom SQL to load your S3 Metadata table into QuickSight. This will allow you visualize your S3 Metadata in customer Amazon QuickSight dashboards.
3.1. Configure permissions for the QuickSight service role
1. Navigate to Identity and Access Management (IAM) from the AWS Management Console.
2. Under Roles, identify the QuickSight service role, i.e., aws-quicksight-service-role-v0.
3. Under the Permissions tab, add an inline policy for the Glue GetCatalog permission to access the S3 table catalog from QuickSight. Use the following JSON to add the inline policy.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": "glue:GetCatalog",
"Resource": "*"
}
]
}
3.2. Configure permissions for the QuickSight admin user
Identify the QuickSight admin user and its Amazon Resource Name (ARN) to grant Lake Formation permissions to access the S3 table catalog. Run the following CLI command from AWS CloudShell (or any terminal you choose) and note down the QuickSight admin user’s ARN.
$aws quicksight list-users --aws-account-id <AWS Account ID> --namespace default --region <AWS Region>
Navigate to the Lake Formation service from the AWS Management Console. Select Data permissions > Grant.
In the Principals section, select SAML users and groups and add the QuickSight admin user’s ARN. Then, select the Named Data Catalog Resources and grant the database permission and table permissions separately, as shown in the following pictures.
First, we will add permissions at the database level.
Select Named Data Catalog resources and choose the S3 Tables catalog and database.
Select the database permissions.
We now need to do the same thing, but this time at the table level. From the Lake Formation Console, select Data Permissions > Grant again.
Select Named Data Catalog resources and choose the S3 Tables catalog, database, and table.
Select the table permissions.
Navigate to the AWS Lake Formation service from the AWS Management Console. Select Data Permissions and verify that you have the permissions on the database, table, and column resources.
3.3. Create a QuickSight dashboard
Use the following steps to create a dashboard:
1. Sign in to the QuickSight console. From the user profile on the top right of the QuickSight console, choose the same AWS Region where you have created the S3 table bucket. For example, if you created the S3 table bucket in the Ohio (us-east-2) Region, select the same Region in QuickSight.
2. In the QuickSight dashboard, click on New analysis. On the New analysis page, click on New dataset.
3. In the Create a Dataset modal, select Athena as the data source, enter a name for your dataset, and click Create data source.
4. On the Choose your table session, select Use custom SQL. Please note, you will not be able to see the table in the dropdown list.
5. Enter a custom SQL query based on the columns you would like to visualize, and then select Confirm Query. In this example, you have selected all the columns by providing the following query. Replace s3-metadata-table-bucket with your S3 table bucket and replace s3_metadata_primary with your S3 Metadata table.
SELECT * FROM "s3tablescatalog/s3-metadata-table-bucket"."aws_s3_metadata"."s3_metadata_primary"
6. Click Visualize to analyze data and creating dashboards. Please note, you won’t see the ‘s3tablecatalog’ and ‘aws_s3_metadata’ as databases. The data source will be the same name that you have given in step 3.
7. Select Create in the QuickSight to get started. To analyze and create dashboards, refer to the Visualizing data and Exploring dashboards sections in the QuickSight User Guide.
8. Select the columns you would like to analyze and create dashboards.
Examples of QuickSight dashboards:
Cleaning up
When your work is complete in Athena and QuickSight, make sure you disable the S3 Metadata configuration and delete the table to ensure you do not continue to incur costs. To delete the S3 Metadata configuration, go to the S3 Metadata table for the bucket you created the configuration for and click Delete.
Conclusion
In this post, we walked through a step-by-step process to configure S3 Metadata and enable integration with Athena and QuickSight. We provided examples of SQL queries that you can use to find data insights into your S3 data.
Amazon S3 gives you the ability to store virtually unlimited objects in a single bucket for use cases like archive, analytics and, data lakes. While this makes it a powerful tool for storing data, there are situations where you want to manage individual files intelligently for reporting and automation to achieve business goals. Typical business goals are to understand usage, verify security, increase efficiency, drive down costs, or migrate data to other environments selectively. By bringing in AWS Analytics services to help analyze your object metadata, you can manage your data and achieve your business objectives.
To learn more about S3 Metadata (Preview), read the AWS News Blog.
Thanks for taking the time to read this post. If you have any feedback or questions, please don’t hesitate to leave them in the comments section.