Optimize storage cost for your Athena queries

You can use Amazon Athena, a lightweight serverless, analytics tool, to query your AWS Cost and Usage Report (CUR). This enables you to dive into your cost and usage data for spend reporting and optimization analysis. However, you may not know all the opportunities you can optimize your Amazon S3 costs by taking advantage of S3 Lifecycle configuration.

Why do you need to optimize your S3 costs for Athena queries?

To put it very simply, when you run Amazon Athena queries on your CUR, you store the results in S3 and you pay to store them there. Many customers overlook the importance to setup the correct configuration to remove unused results and therefore waste money. In this blog, we will establish a cost optimal Amazon S3 bucket configuration for storing Amazon Athena query results.

Where are Amazon Athena query results stored?

To use Amazon Athena, you must configure an Amazon S3 bucket to store your query results in as CSV files. Within the AWS Console, you select Manage settings on the Athena page. Figure 1 shows the configuration box where you can choose the location of query results, and enter the path to the bucket that you created in Amazon S3 for your query results. Prefix the path with s3://.

Figure 1. Athena Manage Settings View

Each region that runs Amazon Athena must specify a S3 bucket for query results in the same region that Athena runs in. We recommend creating a dedicated bucket for these results. This setup will centralize your results and ensure that the configuration we set will only impact Athena results. For the full setup guide for Amazon Athena, see the getting started guide for AWS Athena.

Why are Amazon Athena results stored in Amazon S3?

Athena stores results on S3 because the API is asynchronous. Queries can take time to run. The API is designed in a way that you start a query with one call, check the status with another call, and consume results with a third. With this model, there is no connection waiting for query to complete, so the service needs to store the results somewhere. That’s where Amazon S3 comes in. This is also where a potential issue arises with objects stored in Amazon S3 that remain unused, yet continue to incur costs.

What is a Amazon S3 lifecycle configuration?

Amazon S3 natively offers a simple way for all customers to cost effectively manage their S3 objects via Amazon S3 Lifecycle configurations. A S3 Lifecycle configuration is a set of rules to define the actions that Amazon S3 applies to a group of objects within a single bucket. There are two types of actions:

Transition actions – These actions define when objects transition to another storage class.
Expiration actions – These actions define when objects expire. Amazon S3 deletes expired objects on your behalf.

We will use both of these for your Amazon S3 Bucket.

What lifecycle configuration should you add to your Amazon Athena bucket?

We will add two lifecycle configurations to our bucket. One is for moving old data to cost efficient storage classes and the other to remove incomplete multi-part uploads.

Transition and delete your old query results

We find that customers rarely access query results more than once after creation. New Athena query results written to an S3 bucket are written to S3 Standard storage class. S3 Standard storage class offers high durability, availability, and performance for a wide range of use cases. In cases where query results are retained for longer than 2 months, customers can achieve cost optimization by transitioning larger objects into cost efficient S3 storage classes, which offer a lower cost per gigabyte while maintaining the same performance of S3 Standard storage class. Therefore, we have three options for your lifecycle configuration. These will give you the best price based on how you long you want to store these objects. We have used the Performance across the S3 storage classes table for this. There is also the importance of a size filters, as due to the per-object cost of lifecycle transition, it is more cost effective to keep smaller objects in Standard. For recommendations on cost effective archiving, please see the S3 CFM tips playbook. Therefore, we have three options for your lifecycle configuration. These will give you the best price based on how you long you want to store these objects.

If keeping for less than a month – Expire them at the day number you need
If keeping for at least 2 months and at least 1 MB in size – 30 days in Standard + 30 days Standard Infrequent access and expire them at two months or longer.
If keeping 3 months or longer and at least 512 KB in size – Glacier Instant Retrieval at the day number you need and expire them at three months or longer.

More often than not the results are no longer needed after 1 day, so they can be expired. The good thing is, if results need to be subsequently shared and the query result had expired, you can just re-run the query. The below code snippet of AWS CloudFormation for LifecycleConfiguration Expires (deletes) the object after 30 days. While we will use 30 days for the code snippets, the time period can be changed to align to your storage configuration. Please check your whether you are required to keep this data.

Figure 2. S3 lifecycle configuration in console for option 3

You can also use the below code snippet in your AWS CloudFormation:

S3Bucket:
    Type: 'AWS::S3::Bucket'
    Properties:
      LifecycleConfiguration:
        Rules:
        - Id: LifecycleRule
          Status: Enabled
          ExpirationInDays: 30

Remove incomplete multi-part uploads

The second configuration is to remove incomplete multi part uploads (IMPU). Amazon S3’s multipart upload feature allows you to upload a single object to an S3 bucket as a set of parts, providing benefits such as improved throughput and quick recovery from network issues. However, if the upload process is interrupted, Amazon S3 will not assemble the parts. if this happens, your object will not be accessible, but you will pay for the parts that are stored in Amazon S3. But don’t worry, through lifecycle configurations, IMPUs will expire after seven days. Read more about them in the blog Discovering and Deleting Incomplete Multipart Uploads. This configuration is a best practice to be applied to all of your AWS buckets.

S3Bucket:
    Type: 'AWS::S3::Bucket'
    Properties:
      LifecycleConfiguration:
        Rules:
        - Id: delete-incomplete-mpou
          AbortIncompleteMultipartUpload:
            DaysAfterInitiation: 7
          ExpiredObjectDeleteMarker: True
          Status: Enabled

Why should you add a lifecycle configuration to your Athena Query bucket?

We often see customers spend money on query results they never use. Figure 2 below shows the impact of having these lifecycle configuration on your buckets. STOP WASTING MONEY.

Figure 3: Impact of deleting IMPUs on Athena bucket costs

How to see if you have IMPUs

There are two simple ways to see if your bucket has IMPUs. These are Amazon S3 Storage Lens and the CUDOS Dashboard. You can also use these tools to understand how applying lifecycle configuration affects your storage over time.

Find multi part uploads with S3 Storage Lens

Amazon S3 Storage Lens is a cloud-storage analytics tool for organization-wide insights into object-storage usage and activity. Fortunately for us S3 Storage Lens can help here. Figure 3 shows the Amazon S3 Storage Lens metrics dropdown, highlighting Incomplete, with the metric for Incomplete MPU Objects. You can either filter for your specific Athena bucket, or you can use this visual to see what buckets have any IMPUs. Then you can apply you lifecycle configuration to not only your Athena results buckets but any that have IMPUS. See more in this blog on how to use S3 Storage Lens for Optimization.

Figure 4. Amazon S3 Storage Lens Metric Dropdown

Figure 5. Amazon S3 Storage Lens Trends and distribution visual

Find incomplete multi part uploads on CUDOS

The second option to find IMPUS is the CUDOS dashboard. CUDOS is one of the Cloud Intelligence Dashboard. It uses Amazon QuickSight and is built on the AWS Cost & Usage Report. The Amazon S3 tab shows buckets with IMPU data at the very bottom of the sheet. Figure 5 highlights buckets with IMPUs shows you buckets where you have not completed MPUs. These buckets should be first candidates to check details in S3 Storage Lens and implement lifecycle configuration for IMPU deletion.

Figure 6. CUDOS Amazon S3 Tab

Advice for FinOps teams

In summary, the goal of this blog is for you to take ownership of the results you generate from your Amazon Athena Queries. As a FinOps practitioner, you are often the ones driving cost optimization in their organizations. Optimizing your own workloads is a great way to highlight to teams what ‘good’ looks like. The success from your optimization should be shared! Your own S3 best practices can help scale learnings and drive company-wide optimization goals.

Note:

We recommend S3 Versioning is enabled on your S3 buckets, to help you recover objects from accidental deletion of overwrite. When versioning is enabled, lifecycle expiry rules add a delete marker as the current version. You should add a lifecycle configurations to permanently delete noncurrent object versions after your desired retention period, as well as expired object delete markers.

AWS Cloud Financial Management