AWS Storage Blog
Analyze access patterns and use the most cost-effective Amazon S3 storage class
Businesses that use data lakes, machine learning (ML), and analytics need scalable data storage. However, not all stored data is accessed equally. Some portions of data may be accessed often, while whereas other portions of data are rarely accessed. Modern cloud storage allows users to move infrequently used, cold data to lower-cost storage classes. This helps users save money on storage for data that is not frequently accessed. As the amount of data grows, it can be challenging to track access patterns and move data to the appropriate storage classes to get the full cost benefit. Businesses need to carefully monitor their data usage and storage costs to make sure they are maximizing the savings from tiered storage.
Amazon S3 is an object storage service that offers various storage classes to meet different performance, access, resilience, and cost needs. S3 Lifecycle and the Amazon S3 Intelligent-Tiering storage class can help you automatically transition data between different storage classes or tiers to optimize costs. However, realizing full savings means that you must understand actual access patterns first. Storing data in the S3 Standard storage class incurs the same storage cost for frequently and infrequently accessed data, which limits potential savings. Using S3 Lifecycle policies or S3 Intelligent-Tiering without considering the predictability or unpredictability of your access patterns may lead to higher retrieval fees or early delete charges if objects are removed before minimum duration, or missed savings opportunities due to delayed transitions to cheaper storage.
In this post, I analyze access patterns using S3 Storage Class Analysis to help you decide when to transition data to another storage class. After S3 Storage Class Analysis observes the access patterns of a filtered set of data over a period of time, the results can help you make the right choice between S3 Lifecycle policies, for predictable data access patterns, or the S3 Intelligent-Tiering storage class, for unpredictable access patterns. With this solution you can analyze your storage access patterns and optimize storage costs by making the most out of different storage classes and tiers.
Solution walkthrough
We divide our approach into three phases:
- Enable: Enable S3 Storage Class Analysis reports on the buckets identified. By the end of this phase, the S3 Storage Class Analysis report is delivered to the Amazon S3 path mentioned during configuration.
- Analyze: Analyze the S3 Storage Class Analysis report that gets updated every day. By the end of this phase, the bucket workload pattern is drawn as either predictable or unpredictable.
- Act: Take action based on the outcome of the analysis phase, with clear indications of access patterns.
Note: S3 Storage Class Analysis is a paid feature, as seen in the Amazon S3 pricing page, and it is chargeable per object.
1. Enable
Before moving to this phase, this blog shows how to identify and target large buckets for cost optimization using S3 Storage Lens. The focus for this post is to enable S3 Storage Class Analysis reports on the buckets identified using S3 Storage Lens.
S3 Storage Lens provides an organization-wide, high-level view of S3 usage and activity metrics across all your S3 buckets. It helps identify optimization opportunities and implement best practices. Storage Class Analysis enables you to monitor storage access patterns, object age distribution, and transition opportunities for specific large-scale S3 buckets, typically at the petabyte level. It aids in precise lifecycle management and cost optimization for individual, high-impact buckets.
S3 Storage Class Analysis reports are available in two forms: directly as an analytics feature in the metrics tab of an S3 bucket or in a comma-separated value (CSV) format that can be exported to another S3 bucket.
This post focuses on the second form, analyzing recommendations in CSV, as CSV report allows us to drill to a date to analyze in detail, whereas metrics would be good representation of metrics over a period.
Enabling S3 Storage Class Analysis report
In the Amazon S3 console, choose Buckets, select the Bucket to analyze, and choose the Metrics tab, as shown Figure 1.
Figure 1: Navigating to S3 Storage Class Analysis report
Go to S3 Storage Class Analysis and choose Create configuration.
Figure 2: Creating configuration of the Storage Class Analysis report
After you select Create Configuration, enter the configuration details:
- Enter Configuration name for the S3 Storage Class Analysis report under Name.
- Under Configuration scope mention the scope of the report by choosing Limit the scope of this rule using one more filters for given prefix or Apply to all objects in the bucket for entire bucket.
- Under Export CSV choose the format of the report, check Enable under Export CSV.
- Choose Destination bucket account by selecting This account or A different account.
- Under Destination, select the Destination bucket to which the report should be redirected to as shown in Figure 3.
Figure 3: Enabling S3 Storage Class Analysis report
After you’ve configured the S3 Storage Class Analysis report, you should start observing data analysis based on the filter in the S3 console in 24–48 hours. However, S3 Storage Class Analysis observes the access patterns of a filtered dataset for 30 days or longer to gather information for analysis before giving a result. The analysis continues to run after the initial recommendation and updates as the access patterns change.
The recommendations from the S3 Storage Class Analysis report only show moving to the Amazon S3 Standard-Infrequent Access (S3 Standard-IA) storage class. However, you can use the patterns identified to derive known or unknown patterns. An example for known and unknown patterns is discussed in the next phase.
By the end of this phase, the S3 Storage Class Analysis report should be delivered to the Amazon S3 bucket mentioned during configuration.
2. Analyze
This phase focuses on analyzing the S3 Storage Class Analysis report that gets updated every day. There are multiple entries for each date after enabling.
The following is an example sample report:
Figure 4: Sample S3 Storage Class Analysis report with multiple entries
The S3 User Guide has a description of each column, including Object Age, Storage_MB, DataRetrieved_MB, and GetRequestCount used for analysis.
A good starting point for analysis is to filter the latest date on the report, as shown in Figure 5.
Figure 5: Sample S3 Storage Class Analysis report filtered with single date.
The columns to focus on to derive patterns include the following: Object Age, Storage_MB, DataRetrieved_MB, and GetRequestCount.
Predictable work load example
In this section we look at an example of a S3 Storage Class Analysis report for a predictable work load pattern, as shown in Figure 6.
Figure 6: Sample S3 Storage Class Analysis report with predictable pattern
The ObjectAge column gives the list of bucketized ages, for example 000-014 would mean objects aged between 0 days to 14 days, 015-029 mean objects aged between 15 days to 029 days, and so on.
Storage_MB gives the total storage in the bucketized age, for example for objects aged 000-014 the total storage used is 277847 MBs, for objects aged 015-029 the total storage used is 290350 MBs, and so on.
GetRequestCount is the total number of API calls directed to the bucketized age group, in this example for the objects aged between 000-014 days there are about 12369 API calls and no API calls for other age groups.
DataRetrieved_MB Data is the data transferred out in MB per storage class with GET requests for the day in the age group. For AgeGroup=’ALL’, the value is the overall data transferred out in MB with GET requests for all age groups for the day.
When analyzed, the preceding report (Figure 6) shows that the patterns are predictable. There is active usage for the objects aged between 0–14 days, and beyond that no API requests for the other aged objects. Therefore, you can use the S3 Lifecycle policy to move the rarely accessed data to the S3 Glacier storage classes after 15 days.
Unpredictable work load example
Here we look at an example of a S3 Storage Class Analysis report for an unpredictable pattern.
The following S3 Storage Class Analysis report (Figure 7) shows an active access for the first 30 days. For the next 30 days (that is from 30 days to 59 days) there is no active access, it is accessed again after 60 days to 119 days, and there are no access requests from 120 days to 179 days. There is no predictability here. Some parts of the data as it gets old are still accessed, and some parts are not accessed at all. In this scenario, you would pay for the S3 Standard charge for both actively accessed and infrequently/rarely accessed parts.
This bucket suits S3 Intelligent-Tiering because it would automatically move the cold part that is not accessed to Infrequent Access tier or Archive Instant Access tier after 30/90 days.
Figure 7: Sample S3 Storage Class Analysis report with unpredictable pattern
In our analysis, we have checked the pattern of a specific day. In general, look at patterns over a period of time, for example a month, to make a conclusion about your access patterns. By the end of this phase, the bucket workload pattern is drawn as either predictable or unpredictable.
3. Act
The last phase deals with acting on the outcome of the analysis phase with a clear indication of the access pattern.
Action for predictable patterns
For predictable patterns, enabling an S3 Lifecycle policy to move infrequently accessed data to S3 Glacier storage classes gives the maximum savings.
The choice between S3 Glacier Instant Retrieval, S3 Glacier Flexible Retrieval, and S3 Glacier Deep Archive depends on the bucket’s business use case. If data needs to be available with millisecond retrieval performance and accessed one time per quarter, then S3 Glacier Instant Retrieval is the right choice. If data can be accessed asynchronously with retrieval times of minutes to hours and accessed one time every six months, then S3 Glacier Flexible Retrieval is the right choice. If the use case allows for data availability within 24–48 hours and accessed on time yearly, then S3 Glacier Deep Archive is the right choice.
There is a one-time transition charge when data is moved through the S3 Lifecycle policy and data retrieval charges when accessed. The following table summarizes the charges for the US East (North Virginia) AWS Region.
Transition to | S3 Lifecycle transition requests into (per 1,000 requests) | Data retrieval requests (per 1,000 requests) | Data retrievals (per GB) | |
S3 Intelligent-Tiering | $0.01 | n/a | n/a | |
S3 Standard-IA | $0.01 | n/a | $0.01 | |
S3 Glacier Instant Retrieval | $0.02 | n/a | $0.03 | |
S3 Glacier Flexible Retrieval | $0.03 | $10.00 | Expedited | $0.03 |
$0.05 | Standard | $0.01 | ||
n/a | Bulk | n/a | ||
S3 Glacier Deep Archive | $0.05 | $0.10 | Standard | $0.02 |
$0.025 | Bulk | $0.0025 |
Table 1: Transition costs for US East (N. Virginia) Region
To calculate the one-time charges, you need the total number of cold objects to be moved, because the transition charges are per 1000 objects, not by size.
Calculating one-time charges
Now we use the same example from Figure 6, where the patterns are predictable. On the current storage, the cold portion (sum of the bucketized groups beyond 15 days) of the data is approximately 6.31 TB. This assumes that approximately 6.31 TB is with a million objects and users can wait for 24–48 hours for the data. The cold portions can be moved to S3 Glacier Deep Archive.
There is a one-time transition charge of $0.05 per 1000 objects to move data from S3 Standard to S3 Glacier Deep Archive (table1).
With one million objects, the transition charges would be (1,000,000/1000) *0.05, which is $50.
An important aspect to consider for this transition is the object size. The transition charges are based on the number of objects, not the size, thus the larger the objects the lower the transition costs per GB of storage.
For an example, say you have 10 TB of data, if you had 10 million small objects of 1 MB each, then the transition costs to S3 Glacier Deep Archive would cost $500, whereas if you had one million objects of 10 MB each, then the transition cost would be $50.
Effective savings
For an example, we can calculate effective savings after one-time charges. From Figure 6, the data from the objects aged beyond 15 days are rarely accessed.
An S3 Lifecycle policy to move the objects aged more than 15 days moves a total of approximately 6.40 TB to cheaper storage classes. In this example, as mentioned previously, we consider S3 Glacier Deep Archive as the destination storage class.
Current cold storage size | Current S3 Standard charges | One-time charges to move to S3 Glacier Deep Archive | Charges after moving to S3 Glacier Deep Archive * | Effective savings | Effective savings % |
6471 GB | $323.55 | $50 | $6.40 (0.00099*5620) | =323.55-$50 =$273.55 | 84.54% |
Table 2: Savings calculation for predictable access pattern
*$0.00099 is the S3 Glacier Deep Archive charge per GB in the US East (N. Virginia) Region.
Action for unpredictable patterns
For unpredictable patterns, S3 Intelligent-Tiering is cost effective, and you can use an S3 Lifecycle configuration to move data from the S3 Standard storage class to the S3 Intelligent-Tiering storage class.
With S3 Intelligent-Tiering, there is a small monitoring fee for the objects. Furthermore, objects larger than 128 KB are only eligible for monitoring and moved between the storage classes based on the last accessed time. There is no monitoring charge for objects <128KB and S3 Lifecycle doesn’t move objects <128 KB to S3 Intelligent-Tiering.
Calculating one-time charges
First, the entire bucket has an unpredictable workload. Therefore, the objects from the S3 Standard storage class should be moved to the S3 Intelligent-Tiering storage class.
There is one transition charge of $0.01 per 1000 objects to move data from S3 Standard to S3 Intelligent-Tiering (table1).
Assuming there is a total of 10 million objects greater than 128 KB in the bucket, the transition charges would be (10,000,000/1000) *0.01, which is $100.
Monitoring charges for S3 Intelligent-Tiering
Second, when the objects are moved from the S3 Standard storage class to the S3 Intelligent-Tiering storage class, monitoring charges would be applicable.
With 10 million objects in total in the bucket with a size greater than 128 KB, monitoring charges for the bucket would be: (10,000,000/1000) *0.0025 = $25 per month.
*Monitoring charges are $0.0025 per 1,000 objects per month in the US East (N. Virginia) Region.
Effective savings
For an example, we can look at how effective savings are after one-time charges. From Figure 7, the data from the object ages 30–44, 45–59, 120–149, and 150–179 are not accessed.
After 30 days
If S3 Intelligent-Tiering is enabled on the bucket, a total of approximately 5.82 TB are identified as cold after 30 days after moving to S3 Intelligent-Tiering. Furthermore, they are moved to the Infrequent Access tier after 30 days and the Archive Instant Access tier after 90 days.
Current cold storage size | Current S3 Standard charges for cold portion of the data | One-time charges to move to S3 Intelligent-Tiering storage class | Charges after 30 days for cold portion with S3 Intelligent-Tiering storage class (Infrequent Access tier)* | Savings | Savings % |
5820 GB | $137 | $100 | $0.0125*5820 = $72.8 | $137-$72.8 = $64.2 | 46% |
Table 3: Savings calculation for unpredictable pattern after 30 days
*0.0125 is the Infrequent access storage class charge per GB in the US East (N. Virginia) Region.
After 90 days
60 days after the objects are moved to the Infrequent Access tier, S3 Intelligent-Tiering moves the current Infrequent Access portion to Archive Instant Access if there is no access automatically without transition costs. This allows more savings for the cold portion of the data.
For clarity, the example would only consider the total infrequent data moved to the Archive Instant Access class to 5820 GB. However, in general S3 Intelligent-Tiering monitors newly ingested objects that are not accessed for 30 days and are moved to the S3 Standard-IA and the Archive Instant Access tier after 90 days.
Current Infrequent data | Current Infrequent charges | One-time charges to move to other storage classes | Charges after 120 days for cold storage with S3 Intelligent-Tiering storage class (Archive Instant Access) | Savings | Savings % |
5820 GB | $64.2 | $0 | $0.004*5820 = $23.8 | $64.2-23.8 = $40.2 | 62% |
Table 4: Savings calculation for unpredictable pattern after 90 days
Other tools to consider
To get a precise number of objects eligible for transition to other storage classes either in the predictable or unpredictable access patterns, there are two key tools that help: Amazon S3 Inventory and Amazon Athena.
Amazon S3 Inventory provides a report in either a CSV, Apache optimized row columnar (ORC), or Apache Parquet output file that lists your objects and their corresponding metadata on a daily or weekly basis for an S3 bucket. Athena is a serverless, interactive analytics service built on open source frameworks that supports an open-table and file format. Athena integrates tightly with Amazon S3 and allows you to analyze an Amazon S3 inventory report.
The blog, “Manage and analyze your data at scale using Amazon S3 Inventory and Amazon Athena”, explains the Athena and Amazon S3 Inventory report integration and querying on the Amazon S3 Inventory metadata. By querying the Amazon S3 inventory report, insights such as objects greater than 128 KB that are eligible for transition, and objects older than certain age, can be derived. This allows for more precise calculations on one-time costs for S3 Lifecycle actions and monitoring charges for S3 Intelligent-Tiering.
Cleaning up
When the access patterns are derived, the Amazon S3 Storage Class Analysis report can be stopped to avoid the charges incurred. The following steps stop the report:
1. In the Amazon S3 console, choose Buckets. Select the Bucket that needs the S3 Storage Class Analysis report to be disabled. Go to the Metrics tab and choose Storage Class Analysis.
Figure 8: Navigating to clean up S3 Storage Class Analysis report
2. Select the report the needs to be deleted, and choose Delete.
Figure 9: Delete Storage Class Analysis report
Conclusion
In this post, we explored how to configure S3 Storage Class Analysis report and the importance of analyzing access patterns across large Amazon S3 buckets to optimize costs. We identified scenarios where moving infrequently accessed data to more cost-effective Amazon S3 Storage Classes using S3 Lifecycle rules or using Amazon S3 Intelligent-Tiering can result in significant savings. We showcased use cases for S3 Lifecycle policies and S3 Intelligent-Tiering, using tools such as S3 Storage Lens and Storage Class Analysis reports. We have also seen calculating savings and one-time transition charges to guide our decision-making process.
Amazon S3 provides Storage Classes and Intelligent-Tiering options to optimize costs based on your workload and business needs. Understanding access patterns of your S3 buckets is crucial for leveraging optimal optimization opportunities. The S3 Storage Class Analysis report helps identify access patterns by analyzing your buckets, enabling you to place your data in the most optimal storage class.
Thank you for reading this post. If you have any questions or comments, leave them in the comments section.
For additional resources, you can refer to this video on enabling Storage Class Analysis and other relevant storage blogs: