How Bridgewater maintains data consistency across Regions using Amazon S3 Replication

Bridgewater Associates is a global macro investment manager, with a core mission of understanding how the world’s markets and economies work by analyzing the drivers of markets and turning that understanding into high-quality portfolios and investment advice for their clients.

The data that drives this economic research is stored in Bridgewater’s data lake, built on top of Amazon S3. Data is critical component that helps enable key business outcomes and the outcomes of Bridgewater’s clients, requiring a very high reliability and availability requirements for Bridgewater’s data.

In this post, we discuss how Bridgewater is protected in the unlikely event of an issue impacting an AWS Region by replicating data for tables to a standby AWS Region while making sure of the consistency of replicated data, thereby vastly improving resiliency posture. Using Amazon S3 Replication, Bridgewater has a consistent view of its data in a standby AWS Region, which is approximately two hours behind as compared to the primary AWS Region. This used to be approximately two weeks in the past, thereby improving resiliency by more than 99%.

How the data is structured

Bridgewater’s data lake has evolved over more than a decade. This architecture predates a lot of recent developments in Open Table Format (OTFs), such as Apache Iceberg and Apache Hudi, that can simplify time travel capabilities that Bridgewater implemented through this architecture. Time travel refers to the ability to query a table as it existed at previous points in time, allowing users to access historical data and analyze changes over time. This implementation also predates Amazon S3 Tables, which was announced during the writing of this post. The data structure introduced here provides a “git-for-data” time-traveling experience on top of Amazon S3.

The core design goals behind Bridgewater’s data lake are as follows:

Data should be structured to allow horizontal scaling of tables. As of this writing, read/write operations are performed on more than 10 PB of data consisting of over 100 billion objects in Amazon S3.
Data should be consistent.
It should be possible to go back in time. This specifically necessitates that data is never deleted.

At a high level, the layout is fairly simple and contains only three components:

There is an Amazon DynamoDB table that has pointers to the latest version of the given group of tables (it is called a “commit”, as in “git commit”). The table is referred to as the “commit table.”
Commit is a proto-serialized object that has mapping of the table name to the table location in Amazon S3. Commits are stored in Amazon S3.
It also contains table data, which is stored under a prefix (table ID) as several ORC files. Unlike open table formats such as Apache Iceberg, there isn’t a manifest listing all of the files that represent a table. Having a unique prefix (table ID) provides this manifest implicitly and enables Amazon S3 to scale out horizontally.

Table names (such as “PurchasingPowerParity”) can potentially contain information that reveals market holdings that Bridgewater considers sensitive and thus are never part of the Amazon S3 prefix directly. Instead, a globally unique identifier (guid) is put into the prefix, and mapping between the table name and the guid is stored inside the encrypted commit object.

The following diagram demonstrates this:

Figure 1: Structure of Bridgewater's commits and tables

Figure 1: Structure of Bridgewater’s commits and tables

This simple structure is powerful in several different ways.

Having a commit object that represents a group of tables allows them to easily get a “consistent view” of the world as of a specific point in time (for example as of June 1, 2019).
Data is never deleted, with S3 Object Lock in use, and hence considered immutable. In this context “immutable” means that the data cannot be updated, and new table versions are stored with a different guid.
Having ParentCommitId allows time travel (the ability to query a table as it existed at previous points in time), which enables understanding how the state of a group of tables changed across time.
Data stored in Amazon S3 allows this architecture to scale to millions of tables that are read concurrently.

Bridgewater data replication primitives

Bridgewater writes data to the bucket in a single Region and uses S3 Replication to replicate data to a bucket in the second AWS Region for high availability. The bucket in the second Region is mostly read-only.

DynamoDB

DynamoDB has a feature called global tables, which enables a table to be distributed across multiple AWS Regions. Writes in one AWS Region are replicated to a different AWS Region asynchronously, and in the case of a race condition “the last update wins.” Bridgewater’s commit table in DynamoDB is set up as a global table to replicate between Region A and Region B.

Time	State in Region A	State in Region B
t0	(key1, value1)	(key1, value1)
t1 – write happens in Region A	(key1, value2)	(key1, value1)
t2 – replication in Region B	(key1, value2)	(key1, value2)
t3 – write happens in Region B	(key1, value2)	(key1, value3)
t4 – write happens in Region A	(key1, value4)	(key1, value3)
t5 – replication in Region A, since timestamp of the row is greater than timestamp in the replication event, value does NOT change	(key1, value4) Note: (key1, value3) update from replication is discarded	(key1, value3)
t6 – replication in Region B	(key1, value4)	(key1, value4)

As you can see in the preceding table, value3 from Region B was “lost” and instead value4 written in Region A persists in both AWS Regions.

Finally, with DynamoDB Global Tables, replication events are replayed in the same order as they happened. This means that in the preceding example it’s impossible for Region B to experience value4 before value2.

Bridgewater also uses DynamoDB Point-in-time recovery, which allows creating a copy of a DynamoDB table at a specific point in the past, for example as of two hours ago.

Amazon S3 Cross-Region Replication (CRR)

You can configure Amazon S3 CRR between two buckets in either a uni-directional or bi-directional manner. In this post we explore uni-directional replication only.

For replication to be enabled, Amazon S3 needs S3 Versioning enabled on the bucket. Even if an object was written under the same key in both AWS Regions and replication happened, the original object won’t be lost and would still be accessible using VersionId.

Unlike DynamoDB, in the case of S3 Replication, objects that are replicated may not follow the strict order in which they were added to the bucket. For example, if objects were added to the original bucket in the order of A, B, C, D, then the replication mechanism can end up creating object D in the target bucket first, followed by objects B, C, and A. This is especially important in the case of Bridgewater’s object layout, as it is possible that the commit object is replicated before the tables it references. It is also possible that out of 10,000 data files representing the table, a small subset would lag in replication, because it is eventually consistent and during this period reading data from such a table may return only a subset of the data.

Replication failures aren’t prevalent, but they can happen. Transient replication failures are automatically retried by Amazon S3. If a failure persists, for example due to a configuration issue, then, to achieve high consistency where 100% of objects are replicated, Bridgewater built a system to keep track of objects that failed to replicate. This system monitors replication failure events, saving the object key to persistent storage and attempting replication later with exponential backoff.

Specific Amazon CloudWatch metrics help monitor the replication process.

ReplicationLatency – the maximum number of seconds by which the replication destination bucket is behind the source bucket for a given replication rule.
BytesPendingReplication – the total number of bytes of objects pending replication for a given replication rule.
OperationsPendingReplication – the number of operations pending replication for a given replication rule.
OperationFailedReplication – the number of operations that failed replication for a given replication rule.

S3 Replication Time Control (S3 RTC)

S3 Replication Time Control (S3 RTC) provides two important features:

Service Level Agreement (SLA) to replicate 99.9% of objects under 15 minutes. Enabling S3 RTC automatically turns on the S3 Replication metrics mentioned previously.
Specific S3 Event Notifications about issues during replication.

The following diagram shows how ReplicationLatency changes over time as objects are added to the bucket, replicate, or fail to replicate, and the events that get emitted.

Figure 2: How replication latency changes over time

Figure 2: How replication latency changes over time

Furthermore, beyond the S3 Replication metrics mentioned earlier, with Replication Time Control Enabled you get three more events:

s3:Replication:OperationMissedThreshold – this event notification is received when an object that was eligible for replication using Amazon S3 RTC exceeds the 15 minute threshold for replication.
s3:Replication:OperationReplicatedAfterThreshold – this event notification is received for an object that was eligible for replication using the Amazon S3 RTC feature and replicates after the 15 minute threshold.
s3:Replication:OperationNotTracked – this event notification is received for an object that was eligible for replication using Amazon S3 RTC but is no longer tracked by replication metrics.

Bridgewater replication architecture

To provide a consistent read-replica in another AWS Region, Bridgewater effectively had to solve a single problem: how to make sure that all referenceable data is available in the second Region. This specifically means that if DynamoDB points to a commit object, then that commit should have already been replicated, and all the data files from all the tables in the commit, and all the data files from all the tables in the parent commit (and so on), have been replicated.

One simple solution to this problem is graph traversal. However, given that Bridgewater can have commit chains that are tens of thousands of commits long (each commit can reference dozens of tables, each having thousands of files), this becomes too slow very quickly.

Instead, Bridgewater used the technologies described previously in the following way:

S3 Replication is configured with S3 RTC enabled.
As objects are replicated, Bridgewater tracks objects that failed to replicate in a persistent store (DynamoDB).
Using a combination of failed replications and ReplicationLatency metric, Bridgewater calculates ReplicationPoint (point in time where all objects written to Region A up to this point have been successfully replicated in Region B).
Get a DynamoDB instance as of ReplicationPoint using point-in-time restore. This is now the consistent replica.

Doing all of this makes sure that any commit that exists in DynamoDB is actually present in Amazon S3, and that their parents and data files are also present (because they have been written prior to that). Files are immutable as they cannot be deleted nor updated; thus object versions don’t need to be tracked and Amazon S3 keys are enough.

Detailed architecture

Think about the replication process as three separate processes that run in parallel:

Replicate objects and track failures.
ReplicationPoint advancement moves the replication point forward in time as objects are replicated from Region A to Region B.
State reconciliation updates DynamoDB in a way that it only points to commits in Amazon S3 that reference fully replicated data.

The first process is about understanding what objects haven’t been replicated. This information is stored in a StateTable (DynamoDB) and effectively has the Amazon S3 key for objects that failed to replicate, as well as the timestamp for when a given object was created, as shown in Figure 3.

Figure 3: Retry objects that failed replication and advance replication point

Figure 3: Retry objects that failed replication and advance replication point

There are two main failure modes accounted for here:

If replication is slower than normal or transient failures happen in S3 Replication that the service retries, then Bridgewater sees OperationReplicatedAfterThreshold events. When things are back to normal and objects are replicated, there aren’t any special events (and although Bridgewater can always reply on the ObjectCreated:Put event, there are simply too many of them to do a DynamoDB lookup for each one). This is why State Table Processor, which is periodically triggered using Amazon EventBridge, checks whenever objects in State Table have been replicated to the secondary bucket and removes them from the tracking table.
When some configuration issue within Bridgewater happens they have to step in and fix the underlying issue. However, after the issue is fixed, Amazon S3 isn’t going to automatically retry replication. This is why Replication Restart Processor (another AWS Lambda function triggered by EventBridge) checks objects in State Table, and, if any failed objects are discovered, then it forces replication to happen (currently the only way to do this is to “copy object on itself”, thereby creating exactly the same content, but a different Amazon S3 versionId, which is enough to trigger another attempt at replication).

The second process takes information from the StateTable and ReplicationLatency metric from Region A (primary Region) and publishes a new CloudWatch metric called ReplicationPoint in Region B (standby Region), which you can use to get consistent state, as shown in the following figure.

Figure 4: Pubilshing ReplicationPoint metric to standby Region

Figure 4: Publishing ReplicationPoint metric to standby Region

Finally, the last step takes the ReplicationPoint metric as input and uses that to get a copy of DynamoDB that would point only to commits referencing the fully replicated data, as shown in the following figure.

Figure 5: Using the ReplicationPoint metric in the standby Region to update DynamoDB table to point to fully replicated commits

Figure 5: Using the ReplicationPoint metric in the standby Region to update DynamoDB table to point to fully replicated commits

It’s worth expanding on what “intelligent copy from Region A to Region B” means in the last diagram. Previously, it was mentioned that the DynamoDB commit table in Region B is a “mostly read-only replica, although it allows writes”. Bridgewater hasn’t (yet) invested into making the Region B restore process fully automated. This means that if some data is written in Region B, then it is not automatically available in Region A. However, other than that the secondary Region is fully functional and supports both reads and writes. This specifically means that you can write data in Region B and this write operation should still satisfy all consistency requirements:

Commits or data written in Region B shouldn’t be automatically replicated back to Region A, thus Bridgewater can’t just write directly to the DynamoDB commit table, as it is a part of GlobalTables configuration.
The point-in-time DynamoDB snapshot only has data from Region A, thus Bridgewater can’t use that either. It doesn’t have any capability to preserve changes that happened in Region B.

Therefore, Bridgewater has one more DynamoDB table in Region B that is written to directly the same way that the commit table is used in the primary Region. However, it is also constantly updated by the replication process. This means that in case of an emergency there is a fully functional system in the secondary Region to reconcile manually if it had to be used for an extended period of time.

Finally, there are a few extra consistency-checks on top of S3 Replication:

Using Amazon S3 Inventory from both primary and secondary buckets, Bridgewater uses Amazon Athena to execute a daily comparison process to make sure all objects from Region A that have been written during the previous day have been replicated to Region B.
During the “intelligent copy” of DynamoDB entries, they also make sure that all the tables referenced by the current commits (not including parents) are actually present in the Region B bucket. This is a good trade-off between having a safety net if something breaks and having to deal with a huge dependency graph of commit chains.

The replication process is orchestrated using a service running on Amazon Elastic Kubernetes Service (Amazon EKS). All the steps described previously, creating a point-in-time replica, “intelligent copy”, consistency checks, etc., are executed back-to-back and take approximately two hours total. After the current iteration is done, the next one starts immediately, which means we have a consistent view in Region B approximately two hours behind as compared to Region A. Bridgewater is a macro-investment manager and operates on wider time frames (days/hours rather than minutes/seconds), thus this delay was sufficiently short to support the business. Therefore, there wasn’t any need to push this further. If a shorter delay is needed, then there are numerous changes that can support it:

Point-in-time restore replaced with consuming from DynamoDB streams directly, reducing the delay by approximately 45 minutes.
Similarly, bulky “intelligent copy” followed by consistency validation replaced with a Lambda function, which would process individual entries in DynamoDB in isolation, reducing the delay by approximately one more hour.

With these two changes, Region B should be no more than 15 minutes behind Region A most of the time.

Conclusion

In this post, we discussed how Bridgewater has set up a replication architecture for their table data that is stored in Amazon S3. They have accomplished this by building upon the native capabilities provided by AWS services such as Amazon DynamoDB and Amazon S3 to create a system that takes into account completeness and consistency of data replicated between their primary and standby AWS Regions. This enables multi-Region resilience for tabular data and enables Bridgewater workloads to reliably use data from their standby AWS Region, should a large scale Regional or service event happen in the primary Region.

Incorporating this replication architecture also vastly improved their recovery point objective: the standby Region now provides a complete and consistent view of their data lake only two hours behind the primary Region (this used to be approximately two weeks in the past, thereby improving resiliency by more than 99%).

To learn more about how Amazon S3 enables customers to protect data, refer to the user guide: Data Protection in Amazon S3.

Acknowledgements

Minoo Portocarrero, Manager of Engineering Research Technology, Bridgewater Associates

Select your cookie preferences

AWS Storage Blog