Transition data to cheaper storage based on custom filtering criteria with Amazon S3 Lifecycle

As your organization’s data grows, effective management of storage costs is crucial for operating an efficient and cost-effective data infrastructure. One of the most efficient strategies to reduce storage costs is transitioning files to less expensive cold storage classes. To optimize storage costs according to their specific needs and requirements, organizations need the flexibility to set up customized transition rules with granular filters.

Amazon S3 Lifecycle lets users transition objects across different storage classes, to optimize their Amazon S3 storage costs. An S3 Lifecycle configuration automates object lifecycle management in S3 buckets, allowing you to transition objects to lower-cost storage classes or expire them based on age or criteria. As of this writing, S3 Lifecycle rules support filtering objects by prefix, object tag, object size, or a combination of these criteria. Furthermore, users often want to customize S3 Lifecycle configuration rules to transition specific file extensions (for example CSV, images, or PDFs), which as of this writing are not filters available in an S3 Lifecycle configuration.

In this post, I show you how to customize S3 Lifecycle configuration rules by using Amazon S3 Batch Operations to tag objects of a specific file type (for example, CSV files). Then, you can use this object tag within the S3 Lifecycle configuration rules to transition objects across more cost-effective S3 storage classes. While this post focuses on transitioning objects with a certain file extension, you can modify or extend this solution to transition objects based on file creation date, name, and storage class. This enables efficient data lifecycle management, storage capacity optimization, compliance with retention policies, and streamlined automation, thus reducing manual effort and costs.

Solution overview

In this solution, you use the S3 Batch Operation Replace all object tags to tag the CSV objects within a bucket. Then, you can use the object tag within S3 Lifecycle rules to transition these objects to a cheaper storage class and reduce storage costs. S3 Batch Operations is recommended when you want to perform batch operations on billions of objects. With S3 Batch Operations, S3 tracks progress and stores a detailed completion report of all actions, providing a fully managed, auditable, and serverless experience.

#1 Solution overview of transitioning objects across storage classes using S3 Batch Operations and S3 Lifecycle

Figure 1: Solution overview of transitioning objects across storage classes using S3 Batch Operation and S3 Lifecycle

The solution workflow is as follows:

1. S3 Batch Operations job scans the objects provided within a manifest file. You create the manifest file automatically as a part of the S3 Batch Operations job creation command.

2. The manifest file contains the S3 objects of the CSV file type. You can filter file types according to your use case using the MatchAnySuffix parameter within the manifest-generator operation of the S3 Batch Operations job creation.

3. Upon completion, S3 Batch Operations job adds an object tag, PatternFound, and marks its value as Yes for CSV files. You should create S3 Lifecycle configuration rules to filter objects based on the object tag PatternFound, filtering objects with names matching the pattern that you provided.

4. When the S3 Lifecycle rules run (daily at midnight UTC), only objects with the object tag PatternFound marked as Yes are transitioned by this rule.

Prerequisites

The following prerequisites are necessary to continue with this post:

1. If you don’t already have an AWS account, then your first step is to create and activate one.

2. An S3 bucket with objects where you want to configure the S3 Lifecycle configuration rule. Refer to the Amazon S3 User Guide to create an S3 bucket.

3. A basic understanding of S3 Batch Operations. Refer to the S3 Batch Operations post for more information.

4. An AWS Identity and Access Management (IAM) role that is used by the S3 Batch operation job. In the following solution, I use a custom IAM role “S3BatchJobRole” that has the necessary privileges to run S3 Batch Operations and access my S3 bucket.

#2 Create IAM role S3BatchJobRole with trust enabled for S3 Batch Operations service

Figure 2: Create IAM role S3BatchJobRole with trust enabled for S3 Batch Operations service

5. I have attached a custom IAM policy PutObjectTaggingBatchJobPolicy to the S3BatchJobRole role. This policy has the necessary privileges to access, create objects, manifest files, and inventory completion reports within my S3 bucket. I have provided my PutObjectTaggingBatchJobPolicy JSON for reference. You can use the same one, just replace the <replace-with-your-bucket> section with your bucket name.

#3 Create IAM policy PutObjectTaggingBatchJobPolicy with necessary privileges and attach to S3BatchJobRole

Figure 3: Create IAM policy PutObjectTaggingBatchJobPolicy with necessary privileges and attach to S3BatchJobRole

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObjectTagging",
                "s3:PutObjectVersionTagging"
            ],
            "Resource": "arn:aws:s3:::<replace-with-your-bucket>/*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutInventoryConfiguration"
            ],
            "Resource": "arn:aws:s3:::<replace-with-your-bucket>"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:GetObjectVersion",
                "s3:ListBucket",
	         "s3:PutObject",
	         "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::<replace-with-your-bucket>",
                "arn:aws:s3:::<replace-with-your-bucket>/*"
            ]
        }
    ]
}

Solution Walkthrough

The solution walkthrough involves:

1. Creating an S3 Batch Operations job

2. Finding number of objects within a bucket for a specific file extension

3. Running the S3 Batch Operations job

1. Creating an S3 Batch Operations job

You can create an S3 Batch Operations job through the AWS Management Console, AWS Command Line Interface (AWS CLI), Amazon SDKs, or REST API. As of this writing, you can’t automate the manifest file creation from the AWS Management Console. You use the S3 Batch Operations CLI command create-job to create an S3 Batch Operations job and automate the manifest file creation. I am running the following command directly from AWS CloudShell. However, you can run it from your local machine as long as you have installed the latest version of AWS CLI and have set up the environment variables to access your AWS account. Replace your account, S3 bucket, and IAM role as applicable.

aws s3control create-job \
    --account-id <replace-with-your-account> \
    --operation '{
       "S3PutObjectTagging": {"TagSet": [{"Key":"PatternFound", "Value":"Yes"}]}
    }' \
    --report '{
        "Bucket":"arn:aws:s3:::<replace-with-your-bucket>",
        "Prefix":"s3batchreport",
        "Format":"Report_CSV_20180820",
        "Enabled":true,
        "ReportScope":"AllTasks"
    }' \
    --manifest-generator '{
        "S3JobManifestGenerator": {
          "ExpectedBucketOwner": "<replace-with-your-account>",
          "SourceBucket": "arn:aws:s3:::<replace-with-your-bucket>",
          "EnableManifestOutput": true,
          "ManifestOutputLocation": {
            "ExpectedManifestBucketOwner": "<replace-with-your-account>",
            "Bucket": "arn:aws:s3:::<replace-with-your-bucket>",
            "ManifestPrefix": "s3batchreportmanifest",
            "ManifestFormat": "S3InventoryReport_CSV_20211130"
          },
          "Filter": {
            "KeyNameConstraint": {
              "MatchAnySuffix": [
                ".csv"
              ]
            }
          }
        }
      }' \
     --priority 98 \
     --role-arn <replace-with-your-S3BatchJobRole-arn> \
     --region us-east-1

#4 Create S3 Batch Operations job

Figure 4: Create S3 Batch Operations job

Upon successful execution, you should observe an S3 Batch Operations job ID. Note this job ID, because you run this job from the S3 console to tag the CSV objects within the S3 bucket.

2. Finding numbers of objects within a bucket for a specific file extension

There are per-request ingest charges when using PUT, COPY, or S3 Lifecycle rules to move data into any S3 storage class. Consider the ingest or transition costs before moving objects into any storage class, and estimate your costs using the AWS Pricing Calculator.

S3 Lifecycle transition costs depend on the number of objects to be transitioned and the total S3 Lifecycle transition requests. You can identify the total number of objects that get tagged and transitioned by navigating to Amazon S3 and choosing Batch Operations. If the job creation was successful, then the job status is reflected as Awaiting your confirmation to run, and the Total objects field gives you the count of objects that get tagged when you run the job.

#5 Get total objects that get tagged by S3 Batch Operations job

Figure 5: Get total objects that get tagged by S3 Batch Operations job

3. Running the S3 Batch Operations job

Once you have validated the objects which will get tagged by the S3 Batch Operations job by reviewing the manifest file, you will run the job to tag these objects.

1. Select the job that you want to run and choose Run job. Validate the job details on the next screen and choose Run job. The status of the job changes to Ready, and upon successful completion the status changes to Completed.

2. S3 Batch Operations generates a job report under the s3batchreport folder of your S3 bucket, which you are using for this solution. This report provides the status for every object that got tagged as a part of S3 Batch Operations job.

#6 S3 Batch Operations job completion report

Figure 6: S3 Batch Operations job completion report

3. In your S3 bucket, spot-check a few CSV files to make sure a new tag PatternFound is created and marked as Yes.

#7 Examining the S3 object to make sure the tag was created successfully for CSV files

Figure 7: Examining the S3 object to make sure the tag was created successfully for CSV files

4. Now that the objects of your interest are tagged, you can use this tag within the S3 Lifecycle configuration rules to transition the objects across S3 storage classes. Refer to the Amazon S3 User Guide for how to create, enable, edit, and delete S3 Lifecycle rules.

5. In the Figure 8, I have created an S3 Lifecycle rule to transition objects with the tag PatternFound: Yes to Amazon S3 Intelligent-Tiering on Day 1 and Glacier Instant Retrieval on Day 2.

#8 Example S3 Lifecycle configuration rule to transition objects

Figure 8: Example S3 Lifecycle configuration rule to transition objects

6. S3 Lifecycle rules are scheduled to run one time every day. When the S3 Lifecycle rules run, you can validate that your objects are transitioned across storage classes. The following figure shows that the CSV files are transitioned to the S3 Glacier Instant Retrieval storage class after two days.

#9 Examining the S3 objects to make sure S3 Lifecycle rules are working as expected

Figure 9: Examining the S3 objects to make sure S3 Lifecycle rules are working as expected

Considerations

Refer to the Amazon S3 User Guide to troubleshoot S3 Lifecycle issues. Along with S3 Lifecycle transition costs, the total cost to implement this solution should consider S3 object tagging and S3 Batch Operations costs.

Cleaning Up

You can navigate to Amazon S3 bucket to delete the S3 Lifecycle rule that you created. Under the Management tab, select the lifecycle rule that you created and click Delete.

#10 Deleting Amazon S3 Lifecycle rule

Figure 10: Deleting Amazon S3 Lifecycle rule

Conclusion

In this post, I covered optimizing S3 storage costs with customized S3 Lifecycle rules. I extended out-of-the-box S3 Lifecycle filtering capabilities by using S3 Batch Operations to create S3 object tags based a specific file extension and then used the tags in an S3 Lifecycle rule to transition tagged objects to a cheaper storage class.

You can further modify or extend this solution by using different filter criteria, such as file creation date, name, storage class (refer to the “Object filter criteria” section in the S3 User Guide), or contents (refer to “Invoke AWS Lambda function” within the S3 User Guide) for object tagging. This solution can help you tailor your storage lifecycle automation to your specific needs, enabling more effective and cost-efficient storage management.

Thank you for reading this post. If you have any comments, feel free to leave them in the comments section. To learn more about S3 Lifecycle or S3 Batch Operations, visit the S3 User Guide.

AWS Storage Blog

Transition data to cheaper storage based on custom filtering criteria with Amazon S3 Lifecycle

Solution overview

Prerequisites

Solution Walkthrough

1. Creating an S3 Batch Operations job

2. Finding numbers of objects within a bucket for a specific file extension

3. Running the S3 Batch Operations job

Considerations

Cleaning Up

Conclusion

Resources

Follow