How Stripe architected massive scale observability solution on AWS

This post is co-written with Cody Rioux, Staff Engineer at Stripe and Michael Cowgill, Staff engineer at Stripe

Stripe powers online and in-person payment processing and provides financial solutions for businesses of all sizes. Stripe operates a sophisticated microservice environment built on top of AWS. In this blog post we will cover the journey and challenges Stripe faced and the solutions they implemented when migrating their observability solution and users onto Amazon Managed Service for Prometheus (AMP)

Prior to embarking on this migration journey, Stripe’s observability footprint included approximately 300M metrics, 40k alerts and 100k dashboard queries produced by ~7k employees. Stripe was utilizing a time-series data platform as their metrics monitoring solution. Stripe recognized the necessity to re-architect their solution when they encountered scalability limits, reliability issues, and increased cost as they continued using this solution.

Challenges

Stripe’s primary challenge with their existing third-party vendor solution was cost and scalability. As they started transitioning to microservices, their production environment became more complex. This change highlighted the need for both higher capacity and higher cost efficiency achieved by migrating metrics storage layer to Amazon Managed Prometheus.

Overview of Solution

Stripe formulated a four phase approach to this migration:

Dual-write metrics to the legacy time-series database (TSDB) and AMP
Translate assets (alerts and dashboards) from the source language into PromQL and Amazon Managed Grafana Dashboards
Validate the results of the translated assets and the dual written metrics in an iterative fashion
Migrate users’ mental models to a Prometheus compatible mental model

Stripe Observability Solution Architecture

Figure 1: Architecture during migration

Collection: Metrics are collected from host applications as well as scraped from Kubernetes clusters and sent to aggregation layer.
Aggregation: The shuffler delivers data to the correct host while the aggregator averages gauges, sums counters and computes percentiles for cardinality reduction.
Writing to AMP: Aggregation layer sends the metrics to the egress proxy that writes to AMP. Metrics from CloudWatch are also written to AMP through egress proxy to have a unified storage layer..
Dual-Write: The aggregated metrics as well as unaggregated metrics are also sent to existing egress proxy to write the metrics to legacy Time Series Database TSDB during the migration process

Phase 1: Dual-Write

The first step is to begin writing metrics into both legacy Time Series Database (TSDB) solution as well as AMP. This is typically fairly trivial when running one of the industry standard collectors such as an OpenTelemetry collector, Vector or Veneur. All that is needed is adding a second sink (Vector, Veneur) or exporter (OpenTelemetry) configuration for writing metrics to the new AMP remote write endpoint. Assuming there are no errors thrown from the collector, the workspace is ready to be queried for the respective data.

Once Stripe started ingesting metrics, they needed to ensure the validity of these metrics. Concentrating statistical validation techniques at that stage would have been suboptimal. Simply put, even a modest false positive rate would have them spending their entire time and budget chasing down false positive metric failures. For example, with 250M MTS and a 0.01% false positive rate, there would have been 25,000 false positives to look into.

Stripe adopted a methodical approach, mapping out each path metrics can take from generation to storage and created test cases to ensure comprehensive coverage of all possible paths. After identifying and resolving a few bugs, they gained reasonable confidence that all paths were functioning as expected. Further validation was deferred until the alerts problem was addressed, as alerts provide a clear signal of what is working correctly and what requires attention, promptly directing human resources to areas of immediate need.

Key Takeaway

Move through the dual writing phase quickly, validating individual code paths but not lingering here. Systemic issues will be discovered during phase 3 if they still exist.

Phase 2: Translation

Once the data was written to AMP, it was time to translate the assets (alerts and dashboards) from the source language into PromQL/AlertManager/Grafana formats. This process presented two trade-off dimensions. The first dimension involved a choice between manual and automated translation – manual translation offered the benefit of doubling as a training exercise, although it was not recommended beyond a few hundred alerts and a few dozen dashboards. The second trade-off dimension necessitated a decision between compiler-based translation and artificial intelligence-based translation.

Stripe opted to create a compiler that ingested their old solution’s query language, performed small optimizations on the parse tree and then compiled the tree as PromQL. The implementation process unfolded through several distinct phases. Initially, they realized that a significant return on investment could be achieved by throwing an exception whenever the compiler encountered something unknown. Eventually, they leveraged validation (discussed further below) to inform where adjustments were required. Additionally, they recognized the importance of acknowledging that the entire input domain did not need to be compiled, as their entire alert surface was likely a small subset of the overall possible input space.

The compiler approach had several benefits:

It gave an accurate picture of query compilation coverage at any given time.
A fix to compilation for one query would improve all queries.
The parse trees enabled them to identify clusters of semantically identical alerts. This proved incredibly useful during the validation phase.

Optimization

Stripe discovered that despite there being a very large number of alerts that could be written, their users had adopted a small set of alert patterns and applied these across the board. In order to exploit this, they provided “modules” or templated alerts which allowed these users to specify what they wanted to alert on (ratios, sustained values, and basic thresholds) rather than how to alert. During the migration, Stripe was able to increase enrollment in these modules to 60% of the total alerts. This migration towards a declarative approach to specifying alerts significantly reduced the surface area of alerts that needed to be translated.

Key Takeaway

Regardless of the translation method chosen, ensure you’re well set up to iteratively improve the output once validation results come in.

Phase 3: Validation

Validating the translated alerts is a critical step to build confidence and assure teams that the new system will work as expected. The concept is quite simple, but implementation of this can be quite challenging. Stripe applied the following methods:

Validated historical events with automated promtool (Tooling for the Prometheus monitoring system) unit tests.
Alerts that did not have historical events were validated by similar alerts with historical events.
Truly unique alerts underwent human review.

Automating Unit Tests

Automating unit tests enabled Stripe to establish a feedback system for improving the translation between systems. To automate this process, Stripe required access to the current system events and the new system’s (AMP) time series data. While they could have utilized the current system data, Stripe chose to operate in a dual write mode to eliminate errors introduced by converting the time series data. Enabling dual writing months before the validation stream commenced was a prerequisite for this approach. The translation layer provided an abstract syntax tree of the expression language used in the current system. Leveraging this tree, they could precisely identify the time series, labels, and durations of data required for each historical alert. With this data, they could generate large batches of promtool unit tests runnable in bulk via spark or locally for fast iteration. However, promtool had limitations on the maximum size of the input data. Stripe adopted two strategies to mitigate this. The first was minimizing the input data and leveraging promtool’s expanding notation. The second was creating a private fork with higher limits than the open source version.

The automated tests were scheduled to run daily for all available test cases. While there were numerous opportunities to improve efficiency in running these tests, the duration for which the system would operate had to be taken into consideration. For their migration, this period spanned approximately four months, and Stripe opted for a more brute-force approach, redoing activities that intelligently tracked the state of the tests. One key optimization they implemented was to query for production data only once, as they did not have a separate read-only time series database, and the test system load could potentially push query limits. There were other challenges encountered in this loop, but three are worth highlighting:

The translation tool that Stripe developed was continuously evolving based on feedback, necessitating the testing of every expression each time the tool was updated to avoid undesirable regressions.
Stripe teams were still maintaining their production systems, adjusting existing alerts and creating new ones.
The current system was not without flaws.

The first challenge was more of an advantage for Stripe. The more tests they added, the more aggressive they could be with their translation tooling fixes. Although the feedback loop was painful due to the test loop taking hours to run on a Spark cluster, it enabled Stripe’s teams to bound the time required to answer questions that exceeded human-scale analysis.

Team changes created another challenge. Initially, one might assume the change rate to be low enough to ignore. However, for Stripe, the weekly change rate was less than 5%, averaging closer to 2%. Nevertheless, there were spikes when Stripe’s teams handled incident remediations. While low on a weekly basis, the changes were largely cumulative, resulting in a 10-20% change rate over a month and closer to 30% in three months. To keep up, Stripe’s automation tracked alert versions by date and time of the alert event.

Stripe teams assumed the alerts in their legacy TSDB solution were correct for the purposes of validating the translations. This was however not strictly true. Most teams accepted a level of false positives on their alerts to ensure they have a high recall rate for critical alerts. These false positives manifested as false negatives in the translated Prometheus alerts, and can be caused by a number of expected behaviors including occasional delayed or dropped data. In order to address this, Stripe implemented a number of heuristics with the final backstop being a human reviewer of the alert.

Alerts Missing Test Events

Stripe’s system revealed that nearly 70% of alerts did not have historical events during the time period when both systems were operating concurrently. To validate these alerts, Stripe leveraged the abstract syntax tree (AST) to categorize alert expressions. Building an automated unit test system and incurring the cost to validate edge cases would not have been a worthwhile effort for validating 70% of the alerts. Analysis using the AST indicated less than 1500 unique expression patterns and a test case coverage of greater than 93% of alerts.

The AST is a true tree data structure, and there is a vast body of mathematical theory on tree diffing and categorization. A fairly simple design was required, but this was dependent on the distribution within the alert expressions. The categories largely ignored metric names, labels, and durations. Next, some comparison operators were grouped to reduce cardinality. The final number of categories was slightly larger than 1200. With these categories and the validated test results, the team at Stripe was able to validate more than 94% of all alerts.

Manual Validation

Despite the extensive automation achieved, Stripe’s observability team realized that some tasks still required manual validation. While not surprising, Stripe’s teams took this opportunity to critically assess: “Does this unique alert design really need to be maintained?” Although Stripe did not quantify this aspect of the migration, anecdotal evidence suggests many teams replaced or discarded these alerts. Categorical consolidation of alert expressions was not a migration goal, but reflecting on the data, Stripe’s team believes they could likely significantly reduce the alert expressions used and gain efficiencies in operating their systems.

Additional Lessons Learned in Validation

Some approaches that Stripe tried, but didn’t yield expected results, include statistical validation of time series data from alert expressions and approximated uniform unit tests during fixed time windows around events.

The statistical approaches showed promise, but exceptional cases became more frequent as Stripe expanded. This technique was estimated to be 60% to 70% accurate. While great automation can handle 6 or 7 out of 10 heavy lifting tasks, it proved challenging for Stripe to identify when it was truly correct. The major lesson learned is the need for high precision results. Teams are happy to review something that might be wrong and find out it’s right, but the reverse destroys trust in the system.

Unit tests leveraging prior events and querying input data from historical data in Stripe’s current system at fixed ranges (2, 5, 10, 30, and 60 minutes) seemed feasible in theory. In practice, parsing to get the input time series required true expression parsing to an abstract syntax tree. Regular expressions and heuristics were insufficient, and the error rate and lack of precision ruined confidence in the approach. Once the abstract syntax tree is obtained, approximation is no longer necessary.

Phase 4: Humans

While validation is the most impactful workstream, migrating the mental models of Stripe’s user base was among the most influential on how the migration is perceived across the company. When making a choice about methods, there is a trade-off between scalable and personable methods. Scalable methods will save Observability team’s sanity whereas personable methods will save users’ sanity so both need to be employed.

Examples of very scalable education methods:

Documentation
Recorded Classes / Screencasts
Convention / Being Opinionated

Examples of more personable education methods:

White glove interactions
Small team class / Q&A
Office hours

Developer education is a very broad and complicated topic. Here are three key lessons Stripe team learned:

Engage with company’s technical writing / developer education organization to help with documentation and training and make use of their expertise. They’re the experts on the subject and you’ll need material on how to participate in the migration.
Don’t fall into the trap of measuring progress as a strict percentage of teams migrated. Otherwise you’ll be tempted to involve users earlier than you should. Don’t involve the users until you’ve closed the translation, validation, improved translation feedback loop.
Focus on leveraging personalized education methods. For instance, spending 1:1 time with individuals who will become experts in their organization has a multiplying effect, as they will go on to train others, creating a network of local experts over time.

Conclusion

The key takeaway is that this can be achieved in a reasonable timeline by staying focused on getting that feedback loop closed as quickly as possible. The translation to validation feedback loop enables handling all of the low hanging fruit before engaging with users. Over this year Stripe managed to cut costs by an order of magnitude, almost doubled the MTS capacity available to end users, and doubled the number of alert rules that could be executed.

AWS Cloud Operations Blog

How Stripe architected massive scale observability solution on AWS

Challenges

Overview of Solution

Phase 1: Dual-Write

Key Takeaway

Phase 2: Translation

Optimization

Key Takeaway

Phase 3: Validation

Automating Unit Tests

Alerts Missing Test Events

Manual Validation

Additional Lessons Learned in Validation

Phase 4: Humans

Conclusion

About the authors

Resources

Follow