Establishing RPO and RTO Targets for Cloud Applications

Determining how to protect and recover an application can often be easier than determining how quickly your business needs that application recovered. Establishing the correct recovery objective targets at an application level is a critical part of business continuity planning, though. This blog is intended to help customers as they establish or reevaluate recovery targets, build a recovery plan, and determine how AWS services fit within that plan.

As a quick refresher, RTO stands for Recovery Time Objective and is a measure of how quickly after an outage an application must be available again. RPO, or Recovery Point Objective, refers to how much data loss your application can tolerate. Another way to think about RPO is how old can the data be when this application is recovered? With both RTO and RPO, the targets are measured in hours, minutes, or seconds, with lower numbers representing less downtime or less data loss. Within the context of a Business Continuity Plan, applications having similar RTO targets are grouped together in Tiers, with Tier 0 having the lowest RTO.

Data loss is measured from most recent backup (your recovery point) to the point of disaster. Downtime is measured from the point of disaster until the target is fully recovered and available for service.

Image1:Data loss is measured from most recent backup to the point of disaster. Downtime is measured from the point of disaster until fully recovered and available for service.

When establishing these objectives, keep in mind that recovering an application in 15 minutes (RTO) with less than 1 minute of data loss (RPO) is great, but only if your application actually requires it. This is because having a lower target – faster recovery and/or less data loss – requires additional resources and configuration to accomplish, causing an increase in operational complexity and cost to implement and maintain. Therefore, RTO and RPO targets must be set on an application-by-application basis, and they must be evaluated against the added cost and complexity of the proposed target for each application.

Image 2: As RTO is reduced, the cost and complexity to achieve it goes up.

The preceding image shows how the Recovery Time Objective for an application, along with the acceptable recovery cost, directly influences the recovery options that can be considered. If there are no options when both constraints are applied for your application, then either the RTO needs to be increased, or the additional cost/complexity needs to be justified.

Recovery Planning – Tier 0

In a traditional on-premises data center environment, customers are responsible for recovering applications along with all of their critical supporting infrastructure. Critical infrastructure is different for every customer’s environment, but typically includes network equipment like switches, routers, and firewalls, domain servers for authentication and DNS, OS deployment servers, hypervisors, and data protection infrastructure. These components must be fully recovered before application recovery can begin, hence they would have the lowest RTO and can be grouped together as Tier 0 within the recovery plan. How does Tier 0 recovery change for cloud environments, though?

With AWS, most of the above mentioned components can be implemented in multiple regions for minimal additional cost and near 0 RTO. The physical network infrastructure in the above example is managed by AWS, while the network configuration is defined within your VPC in each region. OS deployments are enabled via Amazon Machine Images (AMIs). These include Amazon-managed AMIs, custom AMIs that you have created, and community-provided AMIs. Hypervisors for EC2 are pre-built and managed by AWS, requiring no customer actions to be ready for instance deployment. Data protection in AWS can be achieved a few different ways, but one example is AWS Backup – a fully managed service to centralize and automate data protection across AWS services and replicate those backups to alternate region(s). DNS can be provided by Route53, or it could be provided along with Domain Authentication by AWS Directory Services. Note that recovery options for AWS Directory Services will vary in cost and recovery time based on which edition of Directory Services you select.

Once the core infrastructure has been identified and grouped within Tier 0, establishing RPO and RTO targets for your applications can begin. Note that as you analyze your applications, watch out for any additional prerequisite components/critical infrastructure that needs to be included in Tier 0.

Tier 1 and beyond

Establishing the remaining Tiers in your recovery plan starts with determining RTO targets for each of your applications, and then grouping applications with similar targets together. Working with the business owner of an application, analyze the risks of downtime and what investment in money, time, and effort is appropriate for your business. While not an exhaustive list, here are few questions to help you get started during this process:

RTO Questions:

What is the impact if this application is unavailable? Does this impact change over time?
Is there a financial cost? How much per hour?
Is there a reputational cost? E.g. – A public facing corporate landing page
Does this application have an SLA with internal or external customers?
Are there external compliance or regulatory requirements this application is subject to?
Does this application depend on any other applications? Do you know of any applications that depend on this application?

Many of the RTO questions can also be asked when determining an RPO target for an application:

RPO Questions:

What is the impact of data loss for this application?
Is there a financial cost? How much per hour?
Is there a reputational cost?
Does this application have an SLA with internal or external customers?
Are there external compliance or regulatory requirements this application is subject to?
Do other applications that depend on data from this application? What are their RPO requirements?
Can lost data be recreated? How long would this take, and is this acceptable?
How often does the data change?

AWS Resilience Hub

AWS Resilience Hub was recently released to help customers establish RPO/RTO targets per application, and then analyze applications against those targets. RPO and RTO objectives are defined in Resiliency Policies within Resilience Hub, either by selecting from a list of predefined policies as shown below, or by creating a custom policy based on your business needs.

Image 3: AWS Resilience Hub contains suggested resiliency policies, each with unique RTO and RPO targets.

Resiliency Policies are assigned to one or more applications, creating a Tier. Applications are then assessed against their Tier’s targets either via a direct request from a user, via a scheduled assessment, or as part of your CI/CD pipeline. Following this assessment, Resilience Hub provides a breakdown of what individual components within your application meet, exceed, or fall short of the targeted objective. Resilience Hub then provides recommendations on how to remediate those components to bring them in line with the policy, while also providing the estimated resulting RTO, RPO, and costs for each remediation option, as shown below.

Image 4: AWS Resilience Hub resiliency recommendations.

To learn more about how Resilience Hub can help you establish and maintain your Business Continuity posture in AWS, click here.

Conclusion

Determining appropriate RPO and RTO targets for each of your applications is a critical, but often overlooked exercise. When working to establish or reevaluate these targets, it helps to discuss the potential impact of delayed recovery times and data loss with the application business owners. Asking some of the questions mentioned above can help you identify RPO and RTO targets that can then be evaluated against the cost and complexity to implement.

When it comes to establishing recovery objectives for applications in the cloud, AWS’ services can be configured to enable rapid cross-region recovery of Tier 0 with minimal additional costs. AWS Resilience Hub can be used to build and assign recovery objectives for your applications and then assess the ability of each application and its components to meet those objectives, providing remediation recommendations where required.

AWS Cloud Operations Blog