AWS Storage Blog
Enhance business continuity within an Availability Zone using AWS Elastic Disaster Recovery
At Amazon Web Services (AWS), we recommend running workloads across multiple Availability Zones (AZ) for high availability and fault tolerance. However, there are certain situations where users need to run their workloads in a single AZ. These include legacy or commercial off the shelf (COTS) applications that don’t support deployments across multiple AZ, workloads that have low-latency requirements, non-production environments that are optimized for cost, and workloads governed by regulations to run within a single physical location. In all of these situations, we must be prepared to respond to the unlikely event that a single AZ experiences an outage. How would we maintain business continuity?
AWS Elastic Disaster Recovery enables you to improve your resilience posture by replicating your workloads across multiple Availability Zones within the same Region. This approach makes sure that in the event of an Availability Zone impairment or outage, your applications can failover seamlessly, minimizing downtime and helping you achieve your Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPOs).
In this post, we show you how to implement a cross-Availability Zone (AZ) disaster recovery (DR) strategy. By using Elastic Disaster Recovery you can continuously replicate data from your primary AZ to a secondary AZ and recover your applications during both planned and unplanned outages.
Solution overview
In this blog, we reference a scenario where a web application resides in a single Availability Zone (AZ) and provides important business functionality with recovery objectives of one minute for RPO and one hour for RTO. We use Elastic Disaster Recovery to protect this application, hosted in primary Availability Zone 1 and allow it to seamlessly fail over to a recovery subnet hosted in Availability Zone 3 during an outage. After failing over, we then use Elastic Disaster Recovery to protect the recovered instance in Availability Zone 3, mitigating the risk of further outages or failure.
Figure 1: Cross-Availability Zone DR solution architecture using Elastic Disaster Recovery
The cross-Availability Zone disaster recovery solution flow works as follows::
- An Amazon Elastic Compute Cloud (Amazon EC2) web application instance running in the primary Virtual Private Cloud (VPC) called blog-primary-vpc, which represents the web application in a single Availability Zone (Availability Zone 1).
- Elastic Disaster Recovery protects the web application by performing block-level replication through a replication agent installed on the source server. Data is sent directly from the source server to a replication server in the staging area VPC called blog-staging-vpc.
- The staging area uses lightweight EC2 instances for the replication servers, along with identically sized, Amazon Elastic Block Store (Amazon EBS) volumes for staging.
- When recovery is initiated, Elastic Disaster Recovery automatically launches the web application in the recovery VPC called blog-recovery-vpc based on the configured launch settings.
- VPC Peering makes sure that the data replication stays within the AWS global infrastructure, which prevents data from traversing the public internet.
- The Elastic Disaster Recovery control plane allows you to configure replication settings, initiate recovery, and protect the recovered instance in another Availability Zone. Furthermore, VPC Endpoints are used to keep all communications within the AWS global infrastructure.
Prerequisites
Before proceeding with this walkthrough, make sure you have the following prerequisites:
- An AWS account.
- An Amazon EC2 key pair (required for instance authentication). For more details, refer to Amazon EC2 key pairs.
- A primary VPC with a public subnet and an Internet Gateway (IGW).
- A staging VPC with a private subnet and VPC Endpoints.
- A recovery VPC with a public subnet and an IGW.
- Two VPC peering connections: one between the primary and staging VPCs, and another between the recovery and staging VPCs.
- An AWS Identity and Access Management (IAM) user with the following managed policies:
- Make sure you have the access key ID and secret access key for this IAM user, as these credentials are needed for the AWS Replication Agent installation. For more details on creating an IAM user, refer to creating an IAM user in your AWS account.
- For production environments, we recommend using temporary credentials. Refer to Securely installing AWS Replication Agent using AWS Security Token Service for more details.
- Make sure that the security group for your VPC Endpoints allows inbound TCP access on ports 443 and 1500.
If you don’t have an available environment, then refer to this GitHub repository for a sample AWS CloudFormation template to simulate the architecture defined in this post. The template is provided as-is and isn’t officially supported by AWS.
Walkthrough
The steps to deploy cross-Availability Zone DR can be summarized in the following order:
1. Set up Elastic Disaster Recovery
2. Install the AWS Replication Agent on the web server
3. Initiate a recovery for the web application
4. Validate the recovered instance
5. Protect the recovered web application
1. Set up Elastic Disaster Recovery
1.1. If you’re initializing Elastic Disaster Recovery for the first time, then follow the steps in DRS User Guide. If you have previously initialized Elastic Disaster Recovery, then update the default replication and launch settings with the following input parameters. Edit the Replication server configuration (in the Settings: Default replication Elastic Disaster Recovery console page), by setting the staging area subnet to your staging private subnet (such as ‘blog-staging-private-subnet’), as shown in Figure 2. Keep the Replication server instance type as ‘t3.small’.
Figure 2: Replication server configuration for Elastic Disaster Recovery
1.2. In Security groups, check the box Always use AWS Elastic Disaster Recovery security group, and from the dropdown choose the Elastic Disaster Recovery Interface Endpoint security group that was created as part of the CloudFormation deployment, as shown in Figure 3.
Figure 3: Security group configuration for Elastic Disaster Recovery
1.3. In the Data routing and throttling section, check the option Use private IP for data replication (VPN, DirectConnect, VPC peering), as shown in Figure 4.
Figure 4: Data routing and throttling configuration for Elastic Disaster Recovery
1.4. Leave the default settings for the Volumes, and Point in time (PIT) policy Save changes.
1.5. Now in the Settings: default launch DRS console page, leave the default DRS launch settings.
1.6. Edit the default EC2 launch template. In the Basic settings section, set the subnet for the launched instance to your recovery public subnet (such as ‘blog-recovery-public-subnet’) and the security group to recovery security group (which allows access to the recovered web server on port 80). Leave the default for the remaining settings and save changes, as shown in Figure 5.
Figure 5: Launch template configuration for Elastic Disaster Recovery
2. Install the AWS Replication Agent on web server to add it as a source server
2.1. Connect to the web server (through ‘EC2 Instance Connect’ or an SSH client)
2.2. In the terminal, enter the following to download the replication agent. This command references the ap-southeast-2 Region, so you must update with the Region into which you’re replicating:
wget -O ./aws-replication-installer-init.py https://aws-elastic-disaster-recovery-ap-southeast-2.s3.ap-southeast-2.amazonaws.com/latest/linux/aws-replication-installer-init.py
2.3. Install the replication agent by running the following:
sudo python3 aws-replication-installer-init.py
When prompted, enter the Region for your Elastic Disaster Recovery environment and use the access key ID and secret key ID for your IAM user. This is part of the CloudFormation template creation output if you use the sample template provided. The following screenshot shows the output for a successful installation of the AWS Replication Agent.
Figure 6: Command line output for the AWS Replication Agent installation on the web server application
2.4. After the replication agent has been successfully installed, you should observe the web server added as a source server in the Elastic Disaster Recovery console. In this case, you’re replicating the web server from the primary Availability Zone 1 (ap-southeast-2a) to staging Availability Zone 2 (ap-southeast-2b) in the Sydney Region. When replication completes, the source server will show ‘Ready’ under Ready for recovery and ‘Healthy’ for Data replication status, as shown in Figure 7.
Figure 7: Web server under source servers of the Elastic Disaster Recovery console
3. Initiate a recovery
3.1. Now that the source server is ready for recovery, you can choose the Amazon EC2 web server hostname and initiate a recovery job by choosing the Initiate recovery action. You can also initiate a recovery drill to test the failover process, as shown in Figure 8.
Figure 8: Initiation of a recovery in the Elastic Disaster Recovery console
3.2. Choose a point in time from which to launch the recovery instance for the source server, as shown in Figure 9. This allows you to recover the server from a specific state.
Figure 9: Choose a point in time from which to initiate the recovery job
3.3. Navigate to the Recovery job history page, where you can view the history of each recovery job to track its progress and make sure of successful recovery, as shown in Figure 10.
Figure 10: Recovery job history in the Elastic Disaster Recovery console
3.4. Once the recovery is complete, a new EC2 instance is created for the web server. Choose the instance ID in the Elastic Disaster Recovery console to view more details about the recovered instance, including its configuration and operational status.
Figure 11: Web server recovered instance in the Elastic Disaster Recovery console
4. Validate the recovered instance
Now that the source web server has been recovered, validate that it’s functional. For the recovered web server, access the public IP address on port 80 through your web browser. You should observe the ‘Welcome to nginx!’ message as shown in Figure 12:
Figure 12: NGINX home page of the recovered web server instance
5. Protecting the recovered web application
5.1. Now that you have successfully launched the recovery instance in the target Availability Zone 3, you must protect this instance as it is now the primary web application. Under Recovery instances, the newly recovered web server has a Pending actions of Protect recovered instance, indicating that it is currently not protected against any subsequent outages, as shown in Figure 13.
Figure 13: Pending actions of ‘Protect recovered instance’ in the Elastic Disaster Recovery console
5.2. The replication settings remain the same as you’re using the same staging as the subnet (blog-staging-private-subnet).
5.3. Create a new version of the EC2 launch template for the source web server by navigating to Source servers in the Elastic Disaster Recovery console and choosing Edit EC2 launch template from the Actions drop down, as shown in Figure 14.
Figure 14: Drop down action to edit the EC2 launch template for the source server in the Elastic Disaster Recovery console
5.4. Update EC2 launch template for the source web server to launch a recovery instance in the ‘blog-primary-public-subnet’, instead of the current ‘blog-recovery-public-subnet’, as shown in Figure 15. Furthermore, set the Security groups to the security group created as part of the CloudFormation deployment. Choose Update template to save changes.
Figure 15: Network settings for the launch template in the Elastic Disaster Recovery console
5.5. Once the EC2 launch template is updated, go back to the Recovery instances tab and choose the recovered web server. Choose Protect recovered instance from the Replication drop down to enable protection, as shown in the following figure 16.
Figure 16: Drop down option to protect the recovered instance in the Elastic Disaster Recovery console
5.6. The recovered web server instance has a pending action of ‘Initiate recovery or drill on source server.’ From the source server, observe the status of ‘rescanning’, which detects and applies changes since the last snapshot, rather than doing a full replication of the source disk, as shown in the following figure 17.
Figure 17: Rescanning progress of the recovered instance in the Elastic Disaster Recovery console
5.7. Once the rescan is complete, the recovery server becomes the new source server. Confirm by checking that the instance ID in the source server host name is the same as the recovered instance ID. You can initiate a recovery or a drill from the source server in the ‘blog-recovery-public-subnet’ to launch a recovery instance in the ‘blog-primary-public-subnet’. Repeat the preceding steps to set up a cross-Availability Zone recovery after another failover event.
Figure 18: New source server is now ready for recovery in the Elastic Disaster Recovery console
Cleaning up
To avoid unexpected charges, clean up any resources that were created while setting up your DR environment. This includes terminating recovery instances and disconnecting the source server from Elastic Disaster Recovery (which removes the AWS resources that were created to enable replication).
Follow the steps in the Amazon EC2 user guide to terminate an instance, delete an EBS volume, and delete an EBS snapshot. If you used the CloudFormation template to set up the environment, then don’t forget to delete the CloudFormation stack to make sure the resources are properly cleaned up. By taking these steps, you can avoid incurring future charges.
Conclusion
AWS recommends that organizations deploy across multiple Availability Zones (AZ) in order to meet business continuity requirements. In certain situations, an application can only run in a single AZ. This could be the result of legacy or COTS application constraints, requirements for low-latency performance, desire to minimize costs or regulatory mandates.
In this post, we demonstrated how to implement disaster recovery for a critical web application running in a single Availability Zone (AZ) using Elastic Disaster Recovery. We started by initializing the Elastic Disaster Recovery service, followed by the installation of the Elastic Disaster Recovery agent. Once this was complete, we were able to achieve continuous data protection between the web application and the replication server. We then initiated a recovery to the target AZ, in response to an outage. Once the server was restored, we validated the recovery of our web application, and proceeded to protect the recovered instance.
AWS Elastic Disaster Recovery can enhance business continuity by providing cross-Availability Zone (AZ) disaster recovery to minimize data loss and disruption for applications running in a single AZ. With Elastic Disaster Recovery, customers can achieve an RTO measured in minutes and an RPO measured in the seconds. To get started on improving your DR strategy and minimizing risks, refer to the AWS getting started documentation for step-by-step guidance on implementing fast, reliable recovery for both cloud-based and on-premises applications. To stay up-to-date on Elastic Disaster Recovery, check out our posts on the AWS Storage Blog channel.