AWS Database Blog

How the Amazon TimeHub team designed a recovery and validation framework for their data replication framework: Part 4

In How the Amazon TimeHub team handled disruption in an AWS DMS CDC task caused by Oracle RESETLOGS: Part 3, we discussed how we built and tested a recovery framework for data replication using AWS Database Migration Service (AWS DMS) and AWS Glue during an Oracle RESETLOGS scenario. We covered the steps in the task recovery process and how we confirm there are no data discrepancies after recovery. With AWS DMS, you can use data validation to make sure your data was migrated accurately from the source to the target. If you enable validation for a task, AWS DMS begins comparing the source and target data immediately after a full load is performed for a table.

In this post, we describe the custom framework we built on top of AWS DMS validation tasks to maintain data integrity as part of the ongoing replication between source and target databases.

Solution overview

In failure recovery scenarios, we need a standard operation procedure to recover from such failures and continue to maintain data integrity.

In the following sections, we cover what features are available within AWS DMS for data validations and their limitations. We then share how we developed a custom data validation framework to address the limitations, so that we can satisfy Amazon TimeHub’s business needs. For this use case, we use Amazon Aurora PostgreSQL-Compatible Edition for our target database.

AWS DMS data validation framework

AWS DMS provides data validation while the replication is ongoing. To achieve this, a validation task first reads all data into the replication instance and performs validation for every record between the source and target. Then AWS DMS reads the logs and validates oncoming changes against it. AWS DMS provides two options with its validation tasks:

  • Validation with ongoing replication – This option uses a replication task that also validates data as it comes from the source
  • Validation only tasks – In this mode, we create standalone validation tasks that validate data independent of the replication tasks

We wanted to keep validation tasks independent of the replication task so that the validation task runs even after the failure of the replication task and reports data discrepancies. This was done to maintain isolation between the two different functionalities. We had benchmarked our replication tasks’ performance and they were meeting our business SLAs. Due to the high data change rate in our environment, we wanted to make sure we have stable performance with ongoing replication, so we decided to use standalone validation only tasks.

Validation only task workflow

The following diagram illustrates the workflow for validation only tasks.

An AWS DMS CDC validation-only task first validates the full data and caches the oncoming changes during this time into the AWS DMS replication instance cache. After the one-time data is validated, it continues to validate the oncoming data against data present in the target. To allow for replication lag in the target, there is a configurable threshold value in seconds for which AWS DMS waits before logging the record as a validation error. At the time of writing, the threshold value is set as 15 minutes (900 seconds). It validates the data multiple times until the threshold wait time is reached before logging it as an error.

The following table contains sample data logged by AWS DMS for a validation error.

TASK_NAME TABLE_OWNER TABLE_NAME FAILURE_TIME KEY_TYPE KEY FAILURE_TYPE DETAILS
validation_task_1 <table_owner> table1 <failure_1_Timestamp> Row {
“key”: [“4905298532”]
}
RECORD_DIFF [[{‘col_1’: ‘3767801528’}, {‘col_1’: ‘<null>’}],[{‘col_2’: ‘2023-08-25 20:42:00’}, {‘col_2’: ‘2023-08-25 20:27:00’}],[{‘col_3’: ‘900’}, {‘col_3’: ‘0’}],[{‘col_4’: ‘out’}, {‘col_4’: ‘missedOut’}],[{‘col_5’: ‘0’}, {‘col_5’: ‘1’}],[{‘col_6’: ‘2023-08-26 03:42:53’}, {‘col_6’: ‘2023-08-26 03:27:08’}],[{‘col_7’: ‘1’}, {‘col_7’: ‘<null>’}],]
validation_task_2 <table_owner> table2 <failure_2_Timestamp> Row {
“key”: [“4905285340”]
}
MISSING_SOURCE

Limitations with AWS DMS validation

AWS DMS data validation has the following limitations:

  • Failure limits – AWS DMS has limits on the maximum number of records that can fail validation before validation is suspended for the task (Data validation task settings). Again DMS stops replication once threshold for apply errors is reached (error handling task settings)
  • False positives – Because AWS DMS allows for average latency and tries multiple attempts to validate the data in the target, we observed false positives (on average 100 records per day) for all the transactions from 153 tables (100 million transactions/day) that we are syncing. We have observed the following patterns for false positives:
    • When the source latency is greater than threshold value due to issues in the data replication.
    • When there are too many changes happening very frequently.
  • Revalidation of logged errors – There is no way to revalidate the records after they’re logged as errors. In ongoing validation mode, validation tasks process the records by transaction log sequence numbers (LSNs). If a record doesn’t arrive within the threshold delay limits and is logged as an error, there is no way to revalidate because the log sequence of the task moves forward.
  • Validation task restart causes all data to be validated – When a validation only task is restarted, it causes all the data to be validated first before proceeding to validate the continuously replicating data. In our use case, it takes approximately 30–40 hours for validation tasks to catch up to the current transactions, especially for the bigger tables that have more than 1 billion records.

Data discrepancy scenarios that we encountered

Let’s look at some data discrepancy scenarios that can’t be handled by AWS DMS validation tasks. Let us examine how an update anomaly can occur during data replication, as illustrated in the following table.

PK Col1 Col2 DML_Type Timestamp Status Reason
Rec1 1 0 Insert 3/12/24 9:20 Done Batch is committed every 15 (configurable) seconds. It was part of the batch.
Rec1 1 1 Update 3/12/24 9:21 Not Done It was not part of the batch.

The task is interrupted at 9:20:05 for reasons like deployment, merging or fixing some ongoing issue, or changing a task setting. After the AWS DMS task is restarted, in the first batch, it tries to merge an insert and update as one insert instead of two transactions. It tries to insert the following record.

PK Col1 Col2 DML_Type Timestamp
Rec1 1 1 Update 3/12/24 9:21

However, because the record already exists in the target, AWS DMS cannot insert the record, and the task fails with a primary key constraint violation error.

We can observe similar behavior in a delete anomaly, as illustrated in the following table.

PK Col1 Col2 DML_Type Timestamp Status Reason
Rec1 1 0 Insert 3/12/24 9:20 Done Batch is committed every 15 (configurable) seconds. It was part of the batch.
Rec1 1 0 Delete 3/12/24 9:21 Not Done Batch is committed every 15 (configurable) seconds. It was not part of the batch.

In the case of a delete scenario, an insert and delete causes AWS DMS to not do anything, and therefore the source change (in this case, deletion of the record) doesn’t get applied to the target.

Steps to overcome the additional data discrepancy scenarios

In this section, we discuss the steps we took to overcome these limitations of AWS DMS data validation. Although most of the AWS DMS data replication from source to target can be monitored, in case of abrupt failures (network failures, or AWS Region infrastructure failures), AWS DMS cannot handle the validations by itself. The next sections describe a standard operation procedure approach to recover from such failures and continue to maintain data integrity post a disaster recovery scenario.

Custom-built revalidation framework to eliminate false positives

In normal circumstances, the AWS DMS validation framework doesn’t log a substantial number of false positives. However, we observed false positives due to the records falling in the border condition or latency, as explained earlier in this post. The daily average is 100 records per day (out of approximately 10 million changes). We have built a framework that is scheduled to run every hour and revalidates the data logged until the last hour. The following diagram illustrates an example workflow of the framework.

The revalidation job scheduled at 12:00 PM considers data logged until 11:00 AM by DMS AWS validation jobs. The next run at 1:00 PM considers the errors logged until 12:00 PM, and so on. It takes the source values for the error records and compares the data in the target at the time it’s running. The delay of 1 hour is induced to accommodate and eliminate the impact of latency (up to 1 hour) on validation. For each record in the validation table, the revalidation framework determines if it’s a true positive or false positive.

The following table shows the sample data after revalidation.

id task_name table_owner table_name failure_time key_type key failure_type details available_in_target matching_with_source last_updated_datetime last_updated_by
12028475 validation_task_1 table_owner table1 <failure_1_Timestamp> Row {
“key”: [“4905298532”]
}
RECORD_DIFF [[{‘col_1’: ‘3767801528’}, {‘col_1’: ‘<null>’}],[{‘col_2’: ‘2023-08-25 20:42:00’}, {‘col_2’: ‘2023-08-25 20:27:00’}],[{‘col_3’: ‘900’}, {‘col_3’: ‘0’}],[{‘col_4’: ‘out’}, {‘col_4’: ‘missedOut’}],[{‘col_5’: ‘0’}, {‘col_5’: ‘1’}],[{‘col_6’: ‘2023-08-26 03:42:53’}, {‘col_6’: ‘2023-08-26 03:27:08’}],[{‘col_7’: ‘1’}, {‘col_7’: ‘<null>’}],] FALSE NULL 00:00.0 DMS_Master
12027282 validation_task_2 table_owner table2 <failure_2_Timestamp> Row {
“key”: [“4905285340”]
}
MISSING_SOURCE TRUE TRUE 00:00.0 DMS_Master

 

After the number of errors are brought down to a minimal through revalidation, the following diagrams illustrates the correction process.

The revalidation framework is aimed to revalidate the errors logged by AWS DMS validation and minimize the errors logged as false positives. However, it can’t fully eliminate the false positives because of the following scenarios:

  • Because the data is always moving in the source and target, both source and target data may have changed between the time the error was logged by AWS DMS and the revalidation framework validating it
  • The process can’t eliminate false positives if the replication delay is greater than 1 hour

When the number of errors is minimized, we compare the source and target records again manually and apply any correction needed.

Manual correction

As explained, data is always changing at the source and target, and therefore, no matter when we query the source and target, there will always be an element of time variance. For example, if an auto-correction script determines that data in the target needs to be corrected and proceeds to apply the correction, data might change in the source in that time delta (between detection and correction). In that event, we may end up contending with AWS DMS, which is replicating the data and will therefore make the record incorrect.

In manual correction, we first find the root cause of a mismatching record and determine whether it’s likely to change again (in this scenario, AWS DMS will auto-correct the record). In cases of source events that don’t have probability of occurring again, we proceed to correct it manually.

Use of table-level filters (on audit columns) to validate partial data

We have explored the possibility of validating partial data in cases when the replication is disrupted and the tasks need to be restarted. This requires supplemental logging to be enabled on all the primary key columns of the table. If the table doesn’t have a primary key, then we need to add supplemental logging on the audit columns (for example, update_date_time). Restarting the validation task without filters will cause it to validate all the records (over 6 billion) and will take multiple days to complete. With the primary key and audit columns supplemental logging approach, we can run the audits only for the time period for which the data needs to be validated incrementally, and therefore retrieve the results in just a few minutes or hours.

Conclusion

In this post, we discussed how we built and tested a data validation framework for data replication using AWS DMS and Aurora PostgreSQL-Compatible. We also covered the data discrepancy scenarios that we observed in our testing and described methods to handle such situations in a controlled manner. These measures help us avoid data integrity issues due to unplanned failures at the source database, AWS DMS, or target database. By using this custom framework with AWS DMS validation tasks, operational teams can maintain data integrity as part of the ongoing replication between source and target databases.

The Amazon TimeHub team’s AWS DMS blog series provides a comprehensive overview of the journey of building a robust data replication framework using AWS Database Migration Service (AWS DMS). The series, spanning four parts, covers the entire process from initial implementation to ensuring data integrity and resilience. Part 1 introduced the low-latency replication solution from Oracle to Amazon Aurora PostgreSQL-Compatible Edition. Part 2 focused on designing resilience and high availability, addressing various failure scenarios and monitoring replication health. Part 3 delved into the specific challenge of Oracle RESETLOGS and how the team developed a recovery framework to handle such disruptions. Finally, Part 4 (this post) explored the custom framework built on top of AWS DMS validation tasks to maintain ongoing data integrity during replication. Throughout the series, we have shared insights into our approach to operational excellence, including monitoring, recovery processes, and data validation techniques, providing a holistic view of building and maintaining a reliable data replication system in a complex enterprise environment.

If you have any questions or comments about this post, share your thoughts in the comment section.


About the Author

Ujjawal KumarUjjawal Kumar is a Data Engineer with Amazon’s Timehub team. He has over 12 years of experience as a data professional working in domains such as finance, treasury, sales, HCM, and timekeeping. He is passionate about learning all things data and implementing end-to-end data pipelines for data analytics and reporting. He is based out of Dallas, Texas.

Amit AmanAmit Aman is Data Engineering Manager with Amazon’s TimeHub team. He is passionate about building data solutions that help drive business outcomes. He has over 17 years of experience building data solutions for large enterprises across retail, airlines, and people experience. He is based out of Dallas, Texas.

Saikat GomesSaikat Gomes is part of the Customer Solutions team in Amazon Web Services. He is passionate about helping enterprises succeed and realize benefits from cloud adoption. He is a strategic advisor to his customers for large-scale cloud transformations involving people, process, and technology. Prior to joining AWS, he held multiple consulting leadership positions and led large-scale transformation programs in the retail industry for over 20 years. He is based out of Los Angeles, California.