AWS Storage Blog

Analyzing Amazon S3 Metadata with Amazon Athena and Amazon QuickSight

Object storage provides virtually unlimited scalability, but managing billions, or even trillions, of objects can pose significant challenges. How do you know what data you have? How can you find the right datasets at the right time?

By implementing a robust metadata management strategy, you can answer these questions, gain better control over massive data lakes, improve data discoverability, accessibility, and extract more value from your data. AWS introduced Amazon S3 Metadata (Preview) at re:Invent 2024, which delivers queryable object metadata in near real-time to organize your data and accelerate data discovery. S3 Metadata helps you to curate, identify, and use your S3 data for business analytics, real-time inference applications, and more.

In this post, we walk through how to set up and use S3 Metadata, how to derive actionable insights using simple SQL queries from Amazon Athena, and then how to visualize your data from Amazon QuickSight. We show you some example queries and dashboards that you can create to insights into and visualizations of your S3 Metadata. You can use these queries or create your own to understand what data you have and to find the right data sets when you need them.

Amazon S3 Metadata (Preview)

S3 Metadata delivers queryable object metadata in near real-time to organize your data and accelerate data discovery. S3 Metadata includes new fields which aren’t available today from other sources such as S3 Inventory reports or ListObjectsV2 API. For a detailed comparison, refer to the following table.

Metadata Available in S3 Inventory Available in ListObjectsV2 Available in S3 Metadata
bucket Y Y Y
key Y Y Y
sequence_number N N Y
record_type N N Y
record_timestamp N N Y
version_id Y Y Y
is_delete_marker Y N Y
size Y Y Y
last_modified_date Y Y Y
e_tag Y Y Y
storage_class Y Y Y
is_multipart Y N Y
encryption_status Y Y Y
is_bucket_key_enabled Y N Y
kms_key_arn N N Y
checksum_algorithm Y Y Y
object_tags N N Y
user_metadata N N Y
requester N N Y
source_ip_address N N Y
request_id N N Y
object_lock_retain_until_date Y N N
object_lock_mode Y N N
object_lock_legal_hold_status Y N N
intelligent_tiering_access_tier Y N N
bucket_key_status Y N N
object_access_control_list Y N N
object_owner N Y N
replication_status Y N N
restore_status N Y N

The metadata table contains mutation events including object creation, metadata update such as tag update, and object deletion. In the metadata table, each row is a change event and each column is a metadata category. S3 Metadata offers at least one delivery of each event, so there may be duplicate records.

Prerequisites

There are a few prerequisites before you get started:

1. Have or create an IAM role to use with S3, Glue, Lake Formation, Athena, and QuickDight. For more information, see IAM roles.

2. Ensure that you have access to Athena from your account.

3. Have or create an S3 general purpose bucket that has traffic. Since S3 Metadata collects mutations, you will need an S3 bucket that has PUT and/or DELETE traffic.

Walkthrough

From a high level, we:

1. Enable S3 Metadata on your general purpose S3 bucket

2. Run SQL queries on Athena to gain insights on S3 Metadata

3. Configure QuickSight to visualize S3 Metadata

1. Enable S3 Metadata on your general purpose S3 bucket

In this section, we walk through the following:

1.1. Create an S3 table bucket to hold the S3 Metadata Iceberg table where the metadata will be stored and enable integration with Glue and Lake Formation.

1.2. Configure S3 Metadata to collect data on a general purpose S3 bucket.

1.1. Create a table bucket and enable integration with AWS analytics services

S3 Metadata uses S3 Tables to store the output table. S3 Tables are purpose-built for storing tabular data at scale. S3 Tables’ preview integration with AWS Glue Data Catalog allows you to query and visualize data using AWS Analytics services such as Athena and QuickSight. Let’s create a table bucket to store the metadata table.

To get started, navigate to S3 in the AWS Management Console. On the left navigation menu, select Table buckets, and then Create table bucket.

Selecting Table buckets and Create table bucket in AWS Console

Enter a name for the table bucket, ensuring the Enable Integration checkbox is selected for integration with AWS Analytics services. Select the Create table bucket button to finalize the table bucket creation.

Entering Table bucket name, enabling analytics services integration, and selecting create table bucket

Once the S3 table bucket is created, you will see this screen.

Successful creation of table bucket

1.2. Configure S3 Metadata

Once you’ve created the table bucket to store the metadata table, you can now navigate to a general purpose bucket to configure S3 Metadata to start producing a metadata table. This is the bucket that you are interested in having S3 Metadata automatically surface object metadata. Once you have navigated to the bucket in the S3 Console, click the Metadata tab. Then, click Create metadata configuration.

Metadata tab in S3 bucket console menu

Select Browse S3, select the S3 table bucket created in the previous step, and click Choose.

Choosing the table bucket created in earlier steps

Accept the default table name or enter another name for the metadata table as you prefer. Click Create metadata configuration.

Create metadata configuration menu showing destination table bucket and metadata table name

2. Run SQL queries on Amazon Athena to gain insights on S3 Metadata

In this section, we walk through the following:

2.1. Configure Lake Formation permissions. Here we will be providing access to the console role that will have access to the S3 Metadata table.

2.2. Verify access to S3 Metadata in Athena. You will find the catalog and database in the S3 Metadata configuration page.

2.3. Explore your S3 Metadata with Athena. You will try some sample queries to better understand what kinds of insights are possible.

2.1. Configure Lake Formation permissions

Once you’ve set up the metadata table, you will need to grant permissions for users to access the table if you want them to query the table from AWS services like Athena and QuickSight. Permissions are granted using Lake Formation.

Go to Lake Formation in the AWS Management Console. If this is the first time you are using Lake Formation, you will see a Welcome window. Select Add myself and optionally add any other users or roles that should be data lake administrators. Then click Get started.

Adding AWS account as admin user in Welcome to Lake Formation pop-up box

Once loaded, select Catalogs from the left menu, and then s3tablescatalog. This is the federated catalog that contains table buckets. For more information on the integration, refer to the S3 Tables documentation on integration with AWS Analytics services.

Catalogs menu in AWS Lake Formation, showing S3 Tables catalog

Select the radio button next to the S3 Table Bucket created in the earlier step, and select Actions, then Grant.

Catalog menu showing Actions dropdown with Grant selected

In the Principals section, select the IAM users and roles you want to grant access to.

Note: Lake Formation also integrates with IAM Identity Center SAML (Security Assertion Markup Language) users and groups, and external accounts. Configuration of these options will not be covered as part of this post.

Choose Principals configuration with IAM users and roles, and the admin IAM role selected

In the LF-Tags or catalog resources section, we will grant access to the metadata table you just created. On this screen, Named Data Catalog resources should already be selected, as should the s3tablescatalog/<<your s3 table bucket>> as the catalog. If s3tablescatalog/<<your s3 table bucket>> is not selected, select it from the dropdown menu and deselect any other catalogs. From the Databases section, select aws_s3_metadata from the dropdown menu. From Tables, select the s3 metadata table you just created to grant access or select All Tables if you want to create more S3 Metadata tables in the same S3 table bucket.

Catalog resources configuration with Names Data Catalog resource radio selected with, s3 metadata table bucket selected from Catalogs dropdown, aws_s3_metadata selected from Databases dropdown, and table created in previous step selected from Table dropdown

In the Table permissions section, we will configure what permission the Principal will have on the table, here we will check Select and Describe in Table and Grantable permissions. For data permissions, keep the default of All data access and click Grant to finish.

Select and Describe check boxes ticked for Table and Grantable permissions. All data access radio box selected for Data permissions.

2.2. Verify access to S3 Metadata in Athena

Once the Lake Formation permissions have been added, navigate to the Athena Console. If this is the first time you are using Athena, use this guide to get started with Athena. You can follow the guide up to and including step 7.

On the left side of the editor, select the Catalog and Database for your S3 Metadata configuration.

Athena Console with S3 table bucket as Catalog and S3 Metadata namespace as database selected on left side. Editor shows SQL statement to select all objects in the table with a limit of 10 rows

You can find the values for Catalog and Database on the S3 Metadata configuration page for your bucket.

S3 Metadata configuration page

Once you have set up Athena, you can run the following query to verify that everything is working correctly.

SELECT * FROM <<table name>> LIMIT 10;

2.3. Explore your S3 Metadata with Athena

Athena is an interactive query service that makes it easy to analyze data in S3 using standard SQL. Here is a series of sample queries you can use to analyze your S3 Metadata from Athena. In the examples below, we have used s3_metadata_primary as the table name. Replace that with your own S3 Metadata table name.

Find objects that match a pattern.

SELECT key FROM s3_metadata_primary where key LIKE '%.jpg'

How up-to-date is the table at this time (timestamp in GMT)?

SELECT MAX(last_modified_date) FROM s3_metadata_primary;

What objects were deleted in the last 24 hours?

SELECT bucket, key, version_id, last_modified_date from s3_metadata_primary where last_modified_date >= (current_date - interval '1' day) and record_type = ‘DELETE’;

How many objects has S3 Lifecycle expired in the last 7 days?

SELECT bucket, key, version_id, last_modified_date, record_timestamp from s3_metadata_primary where requester = 's3.amazonaws.com' and record_type = 'DELETE' and record_timestamp > (current_date - interval '7' day);

What IP address(es) have recent requests come from?

SELECT source_ip_address, count(source_ip_address) from s3_metadata_primary group by source_ip_address;

What Key Management Service (AWS KMS) keys have been used for PUTs in the last 7 days?

SELECT DISTINCT kms_key_arn FROM s3_metadata_primary where record_timestamp > (current_date - interval '7' day);

What objects are not encrypted with KMS?

SELECT key, encryption_status FROM s3_metadata_primary WHERE encryption_status != 'SSE-KMS';

Create a distribution of object tags.

SELECT object_tags, count(object_tags)
FROM s3_metadata_primary group by object_tags;

Are there any objects without tags?

SELECT key, last_modified_date, object_tags FROM s3_metadata_primary 
WHERE cardinality(object_tags) = 0  ORDER BY last_modified_date DESC;

Find objects in a given prefix in the S3 Glacier Deep Archive storage class and create a CSV file that could be used for a batch job to restore those objects.

SELECT bucket, key, version_id FROM s3_metadata_primary where key LIKE ‘legal/%’ and storage_class = ‘DEEP_ARCHIVE’;

3. Visualize your S3 Metadata in QuickSight

Amazon QuickSight is a powerful cloud-based business intelligence (BI) service that empowers organizations of all sizes to unlock the value of their data. You can use QuickSight to create dashboards and visualizations using data stored in S3 Metadata tables. The following are some examples of dashboards you can create:

  • What percentage of your storage data is small objects compared to large object?
  • What is the percentage of object uploads compared to deletions?
  • What percentage of images files are in my bucket?

In this section, we walk through the following:

3.1. Configure permissions for the QuickSight service role. You need to give the QuickSight service role the Glue:GetCatalog permission.

3.2. Configure Lake Formation permissions to the QuickSight admin user. You will grant your QuickSight user with permissions to your database and the tables within your database.

3.3. Create a QuickSight dashboard. You will use custom SQL to load your S3 Metadata table into QuickSight. This will allow you visualize your S3 Metadata in customer Amazon QuickSight dashboards.

3.1. Configure permissions for the QuickSight service role

1. Navigate to Identity and Access Management (IAM) from the AWS Management Console.

2. Under Roles, identify the QuickSight service role, i.e., aws-quicksight-service-role-v0.

3. Under the Permissions tab, add an inline policy for the Glue GetCatalog permission to access the S3 table catalog from QuickSight. Use the following JSON to add the inline policy.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "glue:GetCatalog",
            "Resource": "*"
        }
    ]
}

3.2. Configure permissions for the QuickSight admin user

Identify the QuickSight admin user and its Amazon Resource Name (ARN) to grant Lake Formation permissions to access the S3 table catalog. Run the following CLI command from AWS CloudShell (or any terminal you choose) and note down the QuickSight admin user’s ARN.

$aws quicksight list-users --aws-account-id <AWS Account ID> --namespace default --region <AWS Region>

AWS CloudShell CLI output to capture the Amazon QuickSight user ARN

Navigate to the Lake Formation service from the AWS Management Console. Select Data permissions > Grant.

AWS Lake Formation data permissions

In the Principals section, select SAML users and groups and add the QuickSight admin user’s ARN. Then, select the Named Data Catalog Resources and grant the database permission and table permissions separately, as shown in the following pictures.

First, we will add permissions at the database level.

Grant database permissions to SAML users and groups

Select Named Data Catalog resources and choose the S3 Tables catalog and database.

Select the Named Data Catalog resources and choose the S3 Tables catalog and database.

Select the database permissions.

Database permissions

We now need to do the same thing, but this time at the table level. From the Lake Formation Console, select Data Permissions > Grant again.

Grant table permissions to SAML users and groups

Select Named Data Catalog resources and choose the S3 Tables catalog, database, and table.

Select Named Data Catalog resources and choose the S3 Tables catalogs, databases, and tables

Select the table permissions.

Select and Describe check boxes ticked for Table and Grantable permissions. All data access radio box selected for Data permissions

Navigate to the AWS Lake Formation service from the AWS Management Console. Select Data Permissions and verify that you have the permissions on the database, table, and column resources.

Verifying AWS Lake Formation permissions to the Amazon QuickSight user

3.3. Create a QuickSight dashboard

Use the following steps to create a dashboard:

1. Sign in to the QuickSight console. From the user profile on the top right of the QuickSight console, choose the same AWS Region where you have created the S3 table bucket. For example, if you created the S3 table bucket in the Ohio (us-east-2) Region, select the same Region in QuickSight.

Selecting the Region in Amazon QuickSight

2. In the QuickSight dashboard, click on New analysis. On the New analysis page, click on New dataset.
3. In the Create a Dataset modal, select Athena as the data source, enter a name for your dataset, and click Create data source.

Create Athena data source

4. On the Choose your table session, select Use custom SQL. Please note, you will not be able to see the table in the dropdown list.

Choose your table

5. Enter a custom SQL query based on the columns you would like to visualize, and then select Confirm Query. In this example, you have selected all the columns by providing the following query. Replace s3-metadata-table-bucket with your S3 table bucket and replace s3_metadata_primary with your S3 Metadata table.

SELECT * FROM "s3tablescatalog/s3-metadata-table-bucket"."aws_s3_metadata"."s3_metadata_primary"

Inputting a SQL query to analyze in QuickSight

6. Click Visualize to analyze data and creating dashboards. Please note, you won’t see the ‘s3tablecatalog’ and ‘aws_s3_metadata’ as databases. The data source will be the same name that you have given in step 3.

Selecting 'Visualize' to start creating the QuickSight dashboard

7. Select Create in the QuickSight to get started. To analyze and create dashboards, refer to the Visualizing data and Exploring dashboards sections in the QuickSight User Guide.

Selecting 'Create' to start work on the first Amazon QuickSight chart

8. Select the columns you would like to analyze and create dashboards.

Selecting columns to create charts for analysis

Examples of QuickSight dashboards:

Examples of Amazon QuickSight charts created using S3 Metadata

Cleaning up

When your work is complete in Athena and QuickSight, make sure you disable the S3 Metadata configuration and delete the table to ensure you do not continue to incur costs. To delete the S3 Metadata configuration, go to the S3 Metadata table for the bucket you created the configuration for and click Delete.

Delete the S3 Metadata configuration

Conclusion

In this post, we walked through a step-by-step process to configure S3 Metadata and enable integration with Athena and QuickSight. We provided examples of SQL queries that you can use to find data insights into your S3 data.

Amazon S3 gives you the ability to store virtually unlimited objects in a single bucket for use cases like archive, analytics and, data lakes. While this makes it a powerful tool for storing data, there are situations where you want to manage individual files intelligently for reporting and automation to achieve business goals. Typical business goals are to understand usage, verify security, increase efficiency, drive down costs, or migrate data to other environments selectively. By bringing in AWS Analytics services to help analyze your object metadata, you can manage your data and achieve your business objectives.

To learn more about S3 Metadata (Preview), read the AWS News Blog.

Thanks for taking the time to read this post. If you have any feedback or questions, please don’t hesitate to leave them in the comments section.

Lokesh AP

Lokesh AP

Lokesh AP is a Sr. Solutions Architect specializing in AWS Storage services. Prior to joining AWS, he worked as a technologist for companies such as NetApp and Western Digital. Based in California, he enjoys playing cricket.

Tom Bailey

Tom Bailey

Tom Bailey is a Senior Technical Account Manager at AWS, where he supports Financial Services customers. He has over 15 years experience in IT consultancy, presales, and architecture roles. Outside of work, he enjoys spending time with his family, walking his Labrador, running, and cycling.

Huey Han

Huey Han

Huey Han is a Senior Product Manager for Amazon S3. He focuses on data lake, analytics, and data governance at S3. He is based in New York City. In his spare time, Huey enjoys martial arts.

Lee Kear

Lee Kear

Lee Kear has been working in IT since she received her Master’s Degree in Computer Science from the Georgia Institute of Technology in 1999. She started working at AWS in 2012 as a systems engineer on the Amazon S3 team. She became the first Storage Specialist Solutions Architect specializing in S3 in 2016 and she still enjoys this role today. She loves to help customers use S3 in the most efficient, performant, and cost effective way possible for their use case. Outside of work, she enjoys traveling with her wife.

Fabio Lattanzi

Fabio Lattanzi

Fabio is a Sr. Solution Architect focused on AWS Transfer Family and Amazon S3. He enjoys helping customers build the most durable, scalable, performant, and cost-effective storage solutions for their use case. He is based in Utah, loves traveling with his wife, cuddling with his two dogs, and playing drums.