AWS Storage Blog
Access a point in time with Amazon S3 Object Lambda
Point-in-time ‘snapshots’ enable administrators, developers, testers, and end users to quickly access a storage volume or share how it was at an earlier point-in-time. They are a longstanding approach to data protection and recovery, tracking changes within a storage system to reduce both Recovery Point Objective (RTO) and Recovery Time Objective (RTO). However, traditional snapshots have inherent limitations on scale, and aren’t suitable for billions of objects and petabytes of data in object storage.
Amazon S3 is an object storage service with industry-leading scalability, data availability, security, performance, and 99.999999999% (11 9s) of data durability. In the shared responsibility model, customers are responsible for configuring Amazon S3 to meet their security and compliance objectives. S3 Versioning protects against accidental deletions and overwrites by keeping multiple variants of an object in the same S3 bucket. Data deletion remains a risk even when using Open Table Format abstractions with their own ‘time travel’ capabilities. S3 Object Lock works together with S3 Versioning to protect data against malicious activity, such as ransomware events, by preventing the deletion of specific object versions.
In this blog, we provide a snapshot-like access and recovery solution for Amazon S3. Deployed through AWS CloudFormation, it uses S3 Object Lambda and S3 Versioning to quickly and safely provide one or more read-only, point-in-time views of the data in your S3 bucket. As the solution is read-only, it helps eliminate the risk of changes to your live dataset when accessing a non-current state. In addition, we discuss how you can extend the same solution to streamline and accelerate restore of an entire bucket, as it was at a point-in-time, into a new bucket, for use cases where write access is required. This solution enables many use cases such as data restore, application testing, data analysis, and disaster recovery (DR) testing, without impacting the usual requests to the bucket.
Point-in-time ‘snapshots’ of an S3 bucket, using colors to indicate different versions of keys a, b, and c, and red Xs to indicate objects that do not exist or have a delete marker as their current version. The current versions from each time are highlighted.
Solution overview
This solution enables restoration of a dataset stored in a general purpose S3 bucket or prefix, as it was at a specified point-in-time, by using existing tools and API calls to establish read-only access to the dataset.
The desired versions of objects must still exist. Users can prevent the deletion of specific versions by denying use of the DeleteObjectVersion API with bucket policies, access point policies, or service control policies (SCPs), and/or by using S3 Object Lock. For a complementary solution to assist with this, read “Maintaining object immutability by automatically extending Amazon S3 Object Lock retention periods.”
With S3 Object Lambda, you can add your own code to S3 GET, HEAD, and LIST requests to modify and process data as it is returned to an application. You can use custom code to modify the data returned by S3 GET requests to filter rows, dynamically resize images, redact confidential data, and much more. This solution modifies the data returned by these operations to transparently return to the client only the versions of objects that were current at the point-in-time specified during solution deployment. In other words, clients observe the bucket precisely as it would have appeared at the point-in-time specified.
Solution architecture, showing the components deployed by the CloudFormation stack and the workflows
The letters in the preceding diagram indicate the solution deployment steps:
a. The deployment automatically selects the most recent suitable S3 Inventory. The S3 Inventory is processed with Amazon Athena to build a point-in-time index of object key names and their version IDs.
b. The point-in-time index is stored in Amazon DynamoDB.
The numbers in the preceding diagram indicate the route followed by each Amazon S3 request:
1. Your client issues S3 GET, HEAD, and LIST requests as usual to the S3 Object Lambda Access Point’s bucket-style alias or Amazon Resource Name (ARN).
2. S3 Object Lambda invokes an AWS Lambda function specific to the type of request.
3. The Lambda function queries the point-in-time index in DynamoDB.
-
-
- For LIST operations, the results are returned to the client through the S3 Object Lambda Access Point.
- For GET and HEAD operations, if the request is for an object that doesn’t exist in the point-in-time index (because the object didn’t exist or the current version was a delete marker), a 404 (Not found) error is returned to the client through the S3 Object Lambda Access Point.
-
4. For GET and HEAD requests where the object does exist in the point-in-time index in DynamoDB, Lambda requests data from the specific version ID from the S3 bucket (through the S3 Access Point) and returns this to the client.
Prerequisites
In this section we cover what you need to get started.
- An S3 bucket with S3 Versioning enabled.
- An S3 Inventory for this bucket in Parquet or Apache ORC format that includes all versions as well as all additional metadata
- We recommend configuring S3 Inventory to create a new output for your buckets daily, because this minimizes the waiting period before new points in time are available to be examined with this solution. It might take up to 48 hours for S3 to deliver the first S3 Inventory report, and inventories might not include recently added or deleted objects.
- If you are creating a single inventory for a one-off use case, then you can disable or delete the configuration after the initial delivery.
- AWS Identity and Access Management (IAM) policies to accommodate the point-in-time access point. The following four resources (which are referenced in the CloudFormation stack view) must have permissions granted to work with Object Lambda Access Points. You can visit the documentation on Configuring IAM policies for Object Lambda Access Points to observe example policies and learn more.
- The IAM identity, such as user or role, which accesses the S3 Object Lambda Access Point.
- The S3 bucket and its associated standard access point (known as a supporting access point).
- The Object Lambda Access Point itself.
- The Lambda function execution role. To access objects encrypted with AWS Key Management Service (AWS KMS) (SSE-KMS) keys through this solution, this role also needs the
kms:Decrypt
IAM permission.
Walkthrough
In this walkthrough, we deploy a CloudFormation template then go through three examples of accessing a point in time.
1. Deploy the AWS CloudFormation Template
Download this CloudFormation template. Create a CloudFormation stack using this template in the same AWS Region as your S3 bucket. The stack parameters, shown in the following figure, are:
- BucketName: The source bucket on which you want to create the point-in-time view.
- Delimiter (if other than ‘/’): If you issue a LIST request with a delimiter, you can browse your hierarchy at only one level, skipping over and summarizing the (possibly millions of) keys nested at deeper levels. To support delimiters, the solution must process objects on this basis and store common prefixes in the point-in-time index. If you need the use of LIST operations with a delimiter other than ‘/’, then define it here. Only one delimiter can be defined per deployment.
- ManifestOnly (defaults to false): Change to true if you only wish to follow ‘Fully recreating a point-in-time’, and do not want to also create an Object Lambda Access Point.
- Prefix (optional): If you only want to process objects starting with a particular prefix, then enter it here.
- TimeStamp: The point-in-time that you want the access point to expose, in ISO <yyyy-mm-dd>T<hh:mm:ss> For example: 2024-08-30T02:00:00.
Screenshot of CloudFormation template parameters
When deployed, this CloudFormation template creates the following resources:
- An S3 bucket to store Athena output (known as the solution bucket).
- An AWS Glue database and Athena tables.
- IAM roles for use by Athena and Lambda.
- Lambda functions, to manage the index creation for the Object Lambda Access Point and to cleanly remove the data from the solution bucket when the CloudFormation stack is deleted.
- DynamoDB tables holding the object keys, metadata, and common prefixes.
- An S3 Access Point to the source bucket.
- An S3 Object Lambda Access Point using the preceding S3 Access Point and Lambda functions. Its ARN and alias are provided as outputs. View these in the Outputs tab of your CloudFormation stack.
- One or more manifest files, discussed in the section titled Fully recreating a point-in-time.
2. Accessing a point-in-time
The following two examples show how to access a point-in-time.
Example 1: Comparing points in time, using the AWS Command Line Interface
Here we use the AWS Command Line Interface (AWS CLI) s3api list-objects-v2
command to list the keys in an S3 bucket that do not contain a ‘/’ delimiter. You can observe three objects, and for two of these the current versions were written on 2023-12-31, as shown in the following image.
Listing the original S3 bucket using the AWS CLI
Then we run the same command against the S3 Object Lambda Access Point created by our solution, for a point-in-time during 2023-12-30, as shown in the following image.
Listing the S3 Object Lambda Access Point using the AWS CLI
The output format is identical. However, now we observe the objects as they were on 2023-12-30. If we were to run a get-object
command for one of these objects, then the response would also be as it was on 2023-12-30.
Example 2: Reading objects from a point in time, using Mountpoint for Amazon S3
In the following image, we have used Mountpoint for Amazon S3 to mount the same S3 bucket as in the prior example to /mnt/S3
, and used ls -lR
to list its contents.
Listing the original S3 bucket using Mountpoint for Amazon S3
In the following image, we have mounted the S3 Object Lambda Access Point to /mnt/OLAP
. We see the same S3 bucket, but as it was at our chosen point in time. Three deleted objects (in the /cats folder) are again visible, and three objects have an older modification time.
Listing the S3 Object Lambda Access Point using Mountpoint for Amazon S3
We could now copy cat-3.jpg, or our desired version of cat-6.jpg, with regular file system commands.
Example 3: Fully recreating a point-in-time
Sometimes there is a need to create an independent and fully-operational copy of a bucket (or prefix) as it was at a point-in-time, perhaps for application testing, following malicious activity, or due to the following considerations. Users can achieve this by copying only the desired object versions to a new S3 bucket.
After creating a bucket in the same or another account, use S3 Batch Operations to perform a copy from the source bucket to the new bucket, using the manifest CSV file created when the solution was deployed. The location of this file is given by the ManifestLocation key in the Outputs tab of the CloudFormation console. When deploying the CloudFormation template, you can also choose to set ManifestOnly to true
. This only creates the manifest, and not also the Object Lambda Access Point. Only objects that existed and were the current version at the specified point-in-time are copied. If you have objects larger than 5 GB, then an additional manifest ManifestLocation5GBplus
is created. Use this manifest with the solution in the blog, ”Copying objects greater than 5 GB with Amazon S3 Batch Operations.”
You may also have a desire to ‘roll back’ the source bucket to a point-in-time. You can achieve this by deleting specific versions of objects that were overwritten since that time, as shown in the S3 User Guide. However, this may not be possible due to S3 Object Lock being present on newer versions of keys. If so, then it may be necessary to copy the desired versions of objects to the same bucket, creating newer (and thus current) versions. In addition, we often hear from users that write access is not allowed during investigations into malicious activity. These complexities mean that we have not included ‘roll back’ functionality in this solution.
Considerations
Consider the following when using this solution:
- There must be an S3 Inventory more recent than the desired point in time. Note that S3 Inventory lists are eventually consistent, and they might not include recently added or deleted objects.
- If you have a requirement to be able to revert an S3 bucket to a point-in-time, or track more recent changes, then you may consider the complimentary solution described in the blog ”Point-in-time restore for Amazon S3 buckets.” Reverting a bucket can be useful when only a small percentage of your data has changed, recovery time is a high priority, and there are no constraints on making changes to the original bucket.
- If you have backups of your S3 bucket with AWS Backup or a similar service from an AWS Partner, it may be preferable to use that service to recover your data – either recover object versions into the original bucket or fully recreate a point-in-time in a new bucket.
- When using S3 Object Lambda, follow these best practices and guidelines to optimize operations and performance, and go through this Amazon S3 user guide on using a bucket-style alias for your S3 bucket access point.
- Requests with a version ID parameter, or other versioning-specific requests, such as GetBucketVersioning or ListObjectVersions, are not supported.
- LIST requests with the –delimiter parameter must use the same delimiter character specified during solution deployment.
- PUT requests are denied because the solution is intentionally read-only.
Charges
The charges for deploying this solution are minimal, and primarily from DynamoDB import and storage. Requests incur charges for Lambda and DynamoDB processing, S3 Object Lambda data transfer, and the cost of requests to the underlying data in Amazon S3.
For example, if you deploy the solution against an entire bucket containing 10 million objects and with an existing Amazon S3 Inventory, perform 1000 LIST operations, 1000 GET operations of 1 MB each on average, and delete the CloudFormation stack after one week, the total cost would be approximately $1 (excluding other data transfer costs).
Cleaning up
To avoid incurring future charges, remove the solution when it is no longer needed. Deleting the CloudFormation stack removes the associated Lambda functions, supporting S3 Access Point, S3 Object Lambda Access Point, and DynamoDB and Athena tables. It also removes the Athena output from the solution bucket, and deletes the solution bucket.
You may also want to disable or delete S3 Inventory configurations if they are no longer needed.
Conclusion
Traditional snapshots have inherent limitations on scale, and aren’t suitable for billions of objects and petabytes of data in object storage. In this blog, we introduced a snapshot-like access and recovery mechanism for Amazon S3 using S3 Versioning and S3 Object Lambda. Use this solution to quickly and safely provide one or more read-only, point-in-time views of the data in your S3 bucket, and, if needed, to create a full copy of those in a new S3 bucket.
Access these point-in-time views using S3 GET, HEAD, and LIST requests. This solution modifies the data returned by these operations to transparently return to the client only the versions of objects that were current at the point-in-time specified during solution deployment. In other words, S3 clients observe the bucket precisely as it would have appeared at the point-in-time specified.
Thank you for reading, and for experimenting with our solution. If you have comments or questions, feel free to leave them in the comments section, or create an issue on the GitHub repository.