Capture data changes while restoring an Amazon DynamoDB table

Amazon DynamoDB is a serverless, NoSQL, fully managed database service that delivers single-digit millisecond latency at any scale. Although the point-in-time recovery (PITR) feature in DynamoDB provides a safety net to protect against data loss, running the restoration process can be challenging, especially in production environments. The manual steps involved, such as determining the restore time, redirecting write operations, and updating application configurations, introduce risk and potential downtime, which can be unacceptable for critical applications.

This is the first post of a series dedicated to table restores and data integrity. In this post, we present a solution that automates the PITR restoration process and handles data changes that occur during the restoration, providing a fluid transition back to the restored DynamoDB table with near-zero downtime. This solution enables you to restore a DynamoDB table efficiently with minimum impact your application.

Benefits of PITR

The need for data reliability, quick recovery, and minimal downtime has become the norm across various industries, automating the PITR restore process helps you minimize service disruptions during this process. An automated PITR solution provides not just data recovery, but also increased business continuity, data integrity, and operational efficiency. By automating the steps involved in restoring a table using PITR, organizations can respond rapidly to data issues, minimize downtime, and maintain the trust of their users and customers.

Alternatives to PITR

There are additional data modeling techniques such as version numbers and optimistic locking you can use to ensure your table items are pointing to the correct metadata version, reducing the blast radius of a wrong deployment. With version numbers, you keep the previous metadata for a period you define. If an incorrect application deployment occurs, you must identify the affected items, determine the correct metadata value, and update the current value with the accurate version. However, what happens if the deployment changed several versions of the same item? How could you define which one is the right version? If you are using version numbers such as date time, the solution could be simpler, but what if you are using numbers, or hashes for version control?

Using incremental export to S3 is another interesting alternative. Where, once you identified the wrong application deployment time, you export the DynamoDB data to Amazon Simple Storage Service (Amazon S3) but only from the selected time. With the data in S3, you can run a custom diagnostics script to identify the wrong items and write their previous value to the live DynamoDB table. This method is the quickest because it analyzes only a portion of your table data.

The following are some examples of industries that can benefit from automated PITR solutions:

Ecommerce – Frequently updated product catalogs and promotional features require a reliable safety net to roll back changes without losing recent customer transactions, in case of a restore process required you can restore the entire system to the last know working state.
Content management systems – Rapid deployment cycles to keep up with content demand can sometimes introduce bugs that corrupt data, which an automated PITR solution can quickly resolve without losing new content. Learn more about how media and entertainment customers use DynamoDB for content management systems in this blog.
IoT data collection systems – Ongoing data collection is vital, but errors in data processing must be quickly fixed without interrupting the flow of new, correct data

When needing to perform a PITR restore, engineers often face questions that help define the requirements and underlying challenges, such as: What happens to the data that is being written on the table while it is being restored? Is there a way to update data that was changed during the restore process? Can we minimize downtime and keep the system operational while restoring it?

The following diagram shows the common challenges that you might face when dealing with data issues in a production DynamoDB environment.

The diagram depicts the following key events:

Initial state – The application is initially writing correct data to the DynamoDB table, and the system works as expected.
Issue introduction – The deployment of a new application version has resulted in unintended data corruption or other problems. The data’s integrity may be compromised by various factors, such as a software bug, schema change, or other changes.
Troubleshooting period – The team realizes the data issues and starts the troubleshooting process. During this time, the application continues to write more bad data to the table.
Restore decision – After careful analysis, the team decides that the best course of action is to restore the DynamoDB table to a known good state using the PITR feature.
PITR restore process – The team starts the PITR restore process to bring the table back to a specific point in time before the data issues occurred.

The PITR restore process is a critical step, but it also introduces a new challenge: what happens to the data that is being written to the table during the restoration process? The team needs to figure out a way to capture and incorporate any changes made during the PITR restore process in order to achieve a fluid transition back to the restored DynamoDB table. This is necessary to maintain data consistency and avoid data loss.

The following diagram illustrates the updated process using change data capture (CDC).

Addressing these challenges is crucial for maintaining the integrity and availability of your mission-critical applications that rely on DynamoDB. In the next section, we present a solution that automates the PITR restoration process and handles data changes during the restoration, helping you minimize downtime and maintain data consistency.

Prerequisites

Prepare your local environment and deploy the solution

This solution uses AWS CloudTrail management events to automate triggers around PITR restore events. To configure this automation, make sure CloudTrail management events are enabled in your target account, as shown in the following screenshot.

DynamoDB Table, PITR and DynamoDB Streams

Make sure you have PITR enabled on the DynamoDB table you want to restore. After enabling PITR, you have the option to restore to any point between EarliestRestorableDateTime and LatestRestorableDateTime, which is typically 5 minutes before the current time.

DynamoDB Streams must be enabled as it will be used for CDC. After you enable DynamoDB Streams, copy the stream ARN as it will be used as a deployment parameter.

AWS CDK

To deploy the solution, run the following snippet, which performs the necessary steps using an AWS Cloud Development Kit (AWS CDK) stack to set up, prepare, and deploy the components:

cdk bootstrap
cdk synth -c table-name=<insert table name here> -c table-streams-arn=<ddb streams arn here> 
cdk deploy -c table-name=<insert table name here> -c table-streams-arn=<ddb streams arn here> --qualifier final

Solution overview

In this post, we show how to automate many of these manual tasks and automatically replicate the current data to the newly restored table. The following diagram illustrates the solution architecture.

The workflow consists of the following steps:

The source table serves live traffic.
The system administrator decides to restore the table to a certain point in time.
The source table already has Amazon DynamoDB Streams enabled, and it captures all CRUD operations, enabling CDC functionality. DynamoDB Streams invokes an AWS Lambda function (CDC subscriber) to send the events to an Amazon Simple Queue Service (Amazon SQS) FIFO queue.
The SQS queue holds events until the new DynamoDB restored table is ready.
The DynamoDB action RestoreTableToPointInTime invokes an event through a configured Amazon EventBridge
EventBridge starts an AWS Step Functions workflow to monitor the restoration and status of the table after the event is detected. When this occurs, the workflow activates a Lambda function, enabling the trigger between Amazon SQS and the CDC subscriber Lambda function.
The CDC subscriber Lambda function polls the SQS queue and write all the changes that have been replicated to the destination table.

Validate the deployed resources

The AWS CDK commands create an EventBridge event, Step Functions state machine, SQS queue, and some functions for backfilling the table during the restore action. The state machine will look like the following screenshot.

The AWS CDK stack provisions an SQS queue, which acts as a buffer for CDC.

This solution depends on CloudTrail events as a mechanism to automate triggers. It uses these events to capture a RestoreTableToPointInTime event and initiate the workflow. This is implemented using an EventBridge rule filter based on eventName and the source table name. See the following code:

{
  "source": ["aws.dynamodb"],
  "detail-type": ["AWS API Call via CloudTrail"],
  "detail": {
    "eventSource": ["dynamodb.amazonaws.com"],
    "eventName": ["RestoreTableToPointInTime"]
  }
}

This rule has a target for the Step Functions state machine you created. The state machine manages the status of the PITR process. A PITR restore process can take several minutes depending on the table size. While this process is in a transitive state, all write operations are stored in an SQS queue.

Lambda functions

The solution uses four Lambda functions that have distinct responsibilities. The source code is available in the GitHub repository.

cdc-to-sqs – This replicates write operations on the source table into Amazon SQS. It’s connected in the DynamoDB stream, and acts as a transparent passthrough into Amazon SQS.
check-ddb-status – This function issues describeTable calls, which is the long polling status of the target table (restored table).
initiate-lambda-backfill – This is a control plane type of function. It’s invoked by the state machine when the PITR restore is complete.
lambda-backfill – This function polls the SQS queue until the queue is empty and all the changes have been replicated to the target table (restored table).

Run PITR restore

After you deploy the solution, you can run PITR restore. Configure the restore settings that fit your use case, as shown in the following screenshot.

This solution captures and replicates changes in near real-time during the restore process by sending data to an SQS queue. You can monitor these changes by inspecting the queue messages as they arrive.

Once the PITR job completes, the solution’s step function automatically registers a Lambda function as the consumer of the SQS queue. This Lambda function then begins backfilling the newly restored table with the queued operations.

When the SQS queue is empty, it means that all changes from the source table have been successfully replicated to the new table, ensuring data consistency between the original and restored tables.

Clean up

To remove the components created by this solution, run the following statement:

cdk destroy -c table-name=<insert table name here> -c table-streams-arn=<insert ddb streams ARN here>

Conclusion

Point-in-time recovery is a powerful tool to have in your toolbelt. With just a few Lambda functions and Step Functions for orchestration, you can build powerful deployments that offer flexibility in handling unexpected scenarios. By protecting your data by enabling PITR in your tables with solutions like the one described on this post, you can recover your data with near-zero application downtime.

The following are a few ideas to consider for future upgrades to include in this solution:

After the CDC catch-up process completes, if the solution uses AWS Lambda, automatically update the function’s environment variable to reference the restored table name.
When you enable PITR you can restore to the LatestRestorableDateTime value which is typically five minutes before the current time. To avoid the five minute gap you can run CDC for five minutes before triggering the solution highlighted in this blog.
Add an idempotency layer to ensure the writes from the CDC are only processed once.
Retrieve the incremental changes made to the table from the time of the restore to identify and correct any bad writes with the good writes by using incremental export to S3.

We’ve covered a lot of ground in this post about automating PITR for DynamoDB. Now it’s your turn to take action and try it out! The solution we’ve described is available in this GitHub repository. Give it a spin in your own environment and see how it can simplify your recovery process. Check out our documentation on PITR for DynamoDB to learn more about its capabilities.

This is just the first post in our series on table restores, migration and data integrity. Keep an eye out for upcoming posts where we’ll explore more advanced techniques and use cases.

Have you implemented PITR in your projects? Or maybe you’ve faced challenges with data recovery? We’d love to hear about it in the comments below.

About the Authors

Esteban Serna, Principal DynamoDB Specialist Solutions Architect, is a database enthusiast with 15 years of experience. From deploying contact center infrastructure to falling in love with NoSQL, Esteban’s journey led him to specialize in distributed computing. Today, he helps customers design massive-scale applications with single-digit millisecond latency using DynamoDB. Passionate about his work, Esteban loves nothing more than sharing his knowledge with others.

Leonardo Kanagusku is a Senior Startup Solutions Architect. He has a background in Software Engineering, building applications for the past 15 years. During these years, he gravitated towards one part of the stack typically overlooked: data models. Leonardo is passionate about the data stack in distributed systems, more specifically databases. Currently, he focuses on advising startups on how to grow their business using AWS, and his background has helped hundreds of startups scale and keep their database stack efficient.

AWS Database Blog