A Pilot Light disaster recovery strategy for WordPress

In today’s digital ecosystem, maintaining an uninterrupted online presence and resilience is essential for businesses. WordPress platforms, whether e-commerce sites or news portals, must not only meet but exceed stringent Service Level Agreements (SLA’s) to maintain user trust, ensure continuity, and protect revenue. These SLAs, defining Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO), set the standards for acceptable downtime and data loss. Traditional backup and restore methods often fall short of these requirements, especially when a workload is compromised and could benefit from running in an alternate AWS Region. A robust disaster recovery (DR) strategy, achievable through a multi-Region approach, is necessary.

WordPress is a popular, open-source content management system (CMS) known for its flexibility and ease of use. It powers a diverse range of websites, from small blogs to large corporate sites, offering extensive customization options through themes and plugins. Its user-friendly interface allows individuals without technical expertise to create, manage, and publish digital content efficiently. WordPress’s vast community contributes to its robust selection of tools and resources, making it a leading choice for web development and content management worldwide.

A multi-region DR plan is crucial for WordPress deployments due to the unpredictable nature of cyber-attacks, system failures, and natural disasters. This strategy not only meets rigorous RTO and RPO requirements but also makes sure of the operational resilience that can adapt to digital threats and challenges. By spreading critical workloads across AWS Regions, businesses can swiftly shift operations from a compromised primary Region to another, enhancing resilience and making sure that SLA’s are not just met but exceeded, even during performance bottlenecks or spikes in demand. In this article, we dive into how you can architect a resilient cross-Region Pilot Light DR strategy that uses the robust global infrastructure of AWS.

Pilot Light explained

As outlined in the post, Disaster Recovery (DR) Architecture on AWS, Part III: Pilot Light and Warm Standby, an optimal DR strategy balances the benefits of a lower RTO and RPO, and the cost of implementing and operating the strategy. RTO is the targeted duration to restore a business process or IT service after a disaster. It defines the acceptable amount of downtime before the impact becomes unacceptable. RPO is the maximum acceptable age of files or data in backup storage necessary to resume normal operations after a failure. It determines the data loss tolerance in terms of time.

Figure 2 - Pilot Light DR Architecture for WordPress

Figure 1 – recovery point/recovery time objectives for different DR strategies

Pilot Light strategy is a DR approach where a minimal version of an environment is always running in a recovery Region, as shown in Figure 1. It resembles a pilot light that can quickly ignite the full system, enabling rapid scaling up to a full production environment in case of a disaster. This is done by maintaining live data services – such as data replication from one AWS Region to another, while compute services are only rapidly provisioned and/or scaled up during a recovery event.

Data management for WordPress and challenges

There are three challenges for implementing a cross-Region DR strategy for WordPress:

Database: WordPress primarily uses a MySQL or MariaDB database to store its data. This includes posts, pages, comments, user information, site settings, and metadata. Implementing a cross-Region DR strategy requires a backup database in the second Region to be kept in sync with the primary database.
File System Storage: In addition to the database, WordPress uses the server’s file system to store media files (such as images, videos, and documents), plugins, themes, and core WordPress files. The ‘wp-content‘ directory is particularly important, as it contains themes, plugins, static content, and uploads. Files in the ‘wp-content‘ directory need to be synchronized between the primary site and the DR site. This includes media files, themes, and plugins.
WordPress configuration: WordPress stores its configuration files on the filesystem and these configurations may need to be updated during a recovery event.

The strategy

This strategy assumes that you already have a WordPress site running on AWS using the Well-Architected reference architecture for WordPress, as explained in more detail in this Amazon white paper.

Figure 1 - Recovery Point/Recovery Time Objectives for Different DR Strategies

Figure 2 – Pilot Light DR architecture for WordPress

The architecture for the Pilot Light DR strategy as shown in Figure 2 uses the following capabilities on AWS:

Multi-Region File System Replication using Amazon Elastic File System (EFS) Replication

Amazon Elastic File System (EFS) provides serverless, fully managed, and scalable file storage that lets users share file data without provisioning or managing storage capacity and performance. For the WordPress deployment, static content such as media files, plugins, and themes reside on an Amazon EFS file system, allowing these files to be shared across multiple Amazon Elastic Compute Cloud (EC2) instances serving the WordPress application. Amazon EFS Replication is used to replicate file system data from the primary Region to the backup Region without requiring additional infrastructure or custom processes. EFS Replication is continuous and helps provide a RPO and RTO of minutes, helping users meet compliance and business continuity goals.

When using EFS Replication for this DR solution, Amazon EFS asynchronously replicates data from the primary filesystem to a secondary filesystem in the DR Region. During a failover event, the replication is paused and the EC2 instances in the recovery Region mount the backup Amazon EFS filesystem. EFS Replication also supports failback, making it easier and more cost-effective to synchronize changes between Amazon EFS file systems after a failover event. Note that since EFS Replication is asynchronous, there is a risk of losing a few minutes’ worth of data in the event of a failover.

Amazon Aurora Global Database

Amazon Aurora Global Database is a cornerstone for cross-Region Pilot Light DR plans due to its comprehensive features and capabilities. It is a fully managed service compatible with MySQL (used in WordPress), offering automatic storage scaling, replication across Availability Zones (AZs), and continuous backup to Amazon Simple Storage Service (S3) for enhanced reliability and ease of management. We are using Aurora Global Database, a feature that enables a single Aurora database to be distributed across multiple AWS Regions, which is crucial for maintaining data integrity and resilience. Aurora Global Database offers a unified database endpoint, allowing the use of the same endpoint for both the primary and DR sites, thereby streamlining the failover process by eliminating the need for endpoint reconfiguration. Aurora also uses asynchronous replication with latency typically under one second between the primary and secondary Regions. In the rare event of a database outage affecting an entire Region, the secondary Region can be promoted to full read-write capabilities within minutes to provide DR capabilities.

Elastic, scalable compute services using Amazon EC2 Auto Scaling Groups

A Pilot Light DR strategy requires that compute resources are not pre-provisioned in the secondary Region to optimize costs. However, they must be rapidly deployable in the event of a failover.

Elastic and scalable compute services with Amazon EC2 Auto Scaling Groups: Amazon EC2 Auto Scaling Groups address this need by enabling the dynamic scaling of compute resources. This makes sure that during a DR event, compute capacity in the recovery Region can be quickly adjusted to handle the shifted workload, maintaining application availability without incurring unnecessary costs during normal operations. Setting the desired capacity of Amazon EC2 Auto Scaling Groups to zero in the recovery Region during normal operations minimizes costs. In the event of a failover, the capacity can be quickly scaled up to accommodate the redirected traffic, making sure of a seamless transition and continuous service availability.
Simplified image management with EC2 Image Builder: This service streamlines the creation, testing, and deployment of Amazon Machine Images (AMIs), which are essential for automating the setup of EC2 instances. By using EC2 Image Builder, AMIs can be pre-configured to automatically mount the necessary Elastic File System (EFS) in both the primary and recovery Regions. This makes sure that instances are immediately ready to serve the application with the correct configuration and data access paths upon deployment.

Multi-Region data replication using Amazon S3 cross-Region replication

Amazon S3 is used to store static content such as images, JavaScript, and CSS, which are cached and delivered through Amazon CloudFront, a content delivery network (CDN) service built for high performance, security, and developer convenience. Amazon S3 cross-Region replication is used to replicate these objects stored in a bucket in the primary Region to a bucket stored in the secondary Region. During a failover event, the CloudFront distribution has its Origin updated to point to the S3 bucket in the recovery Region, and the Amazon S3 replication is turned off and set up to replicate in the reverse direction.

DNS Failover with Amazon Route 53

Amazon Route 53 provides highly available and scalable Domain Name System (DNS) domain name registration and health-checking web services. One key capability we are using is DNS Failover, which consists of two components – health checks and failover. Health checks are used to monitor the workload in the primary Region to determine the health of the WordPress application. In the event of a disruption, Route 53 detects the failure of the primary Region and redirects traffic to the Elastic Load Balancing (ELB) in the secondary Region.

Employing Route 53 health checks for reliable failover:

Pre-configuration is Key: Given that it may not be possible to modify DNS records during an outage, pre-configuring health checks and failover routing policies becomes essential. This involves setting up secondary records pointing to the resources in the secondary Region well in advance of disruptions. Begin by setting up two DNS records within Route 53 for your domain:
- A primary record pointing to the primary ELB.
- A secondary record pointing to the secondary ELB in the failover Region.
Health Check Configuration: Configure Route 53 health checks to monitor the health of the primary ELB, as well as critical endpoints within the application. These checks should be designed to accurately detect not only binary failures but also degraded performance, or partial outages (grey failures).
Failover Routing Policy: Use the Route 53 failover routing policy to automatically reroute traffic to the secondary ELB in the secondary Region upon detecting a failure or a grey failure scenario. This makes sure that application availability is maintained, minimizing downtime and preserving user experience.

Amazon CloudWatch Alarms, AWS Step Functions, and AWS Lambda to automate recovery procedures

Amazon CloudWatch Alarms, AWS Step Functions, and AWS Lambda are powerful services within the AWS ecosystem designed to monitor resources and orchestrate workflows, respectively. CloudWatch Alarms allow you to watch over the metrics for AWS resources and applications, sending notifications or automatically making changes to the resources based on predefined rules. On the other hand, the Step Functions service provides a serverless orchestration service that enables you to design and execute workflows as state machines. These state machines can coordinate multiple AWS services into serverless workflows, allowing for the automation of complex business processes. Lambda is a serverless computing service that automatically runs code in response to events, such as changes in data or system state, without requiring the management of underlying servers.

A CloudWatch alarm is used to monitor the status of the Route 53 Health Checks. When the CloudWatch alarm detects a failure in the primary Region, it triggers a Lambda function, which in turn starts a Step Function state machine to orchestrate the failover process to the recovery Region, such as scaling up resources in the Amazon EC2 Auto Scaling Group, pausing and reversing replication of Amazon EFS and Amazon S3, and notifying stakeholders, thereby automating critical aspects of the DR process.

Summary of resources in the recovery Region

Here’s a summary of the essential resources in the recovery Region:

Elastic Load Balancing (ELB): Acts as the entry point for traffic in the recovery Region, distributing incoming application requests across available EC2 instances efficiently. This is running during normal operations to make sure that the Route 53 failover policies can automatically redirect traffic without manual intervention.
Amazon EC2 Auto Scaling Group: Maintains application availability by automatically adjusting EC2 instance capacity in response to traffic demands. Initially set with a zero desired capacity to optimize costs during normal operations, it quickly scales up in response to a failover event to handle redirected traffic.
Amazon Machine Images (AMIs): A pre-configured AMI created using EC2 Image Builder is used to launch EC2 instances in the recovery Region. This AMI includes the necessary configurations for the WordPress site, such as the Amazon EFS mount configuration for the recovery Amazon EFS filesystem.
Amazon Elastic File System (EFS): Provides file storage for static content and media files for the WordPress application replicated from the primary Region.
Amazon Aurora Global Database: Facilitates cross-Region data replication for the WordPress MySQL database, making sure of data consistency and minimal latency in data replication access across primary and recovery Regions. This setup allows for rapid promotion of a read replica in the recovery Region to a primary database in the event of a failover.
Amazon Route 53: Configured with health checks and failover routing policies, Route 53 monitors the health of the application in the primary Region and automatically redirects traffic to the recovery Region if an outage or degraded performance is detected. This makes sure of continuous application availability without manual intervention.
Amazon S3 and CloudFront: For static content delivery, Amazon S3 stores static assets, which are distributed through the CloudFront CDN. Amazon S3 cross-Region replication makes sure that static content is available and consistent in the recovery Region, while CloudFront delivers content with low latency to end-users globally.

These resources form the backbone of the recovery Region’s infrastructure, enabling businesses to maintain operational continuity and quickly recover from unplanned outages with minimal impact on application performance and user experience.

Failover and failback procedures

Automated recovery steps with Step Functions and CloudWatch Alarms

1. Failure detection and notification:

a. CloudWatch Alarms: Alarms based on Route 53 health checks for the primary application endpoint. Upon detecting an issue, the alarm triggers a Lambda function.

b. AWS Lambda Trigger: The Lambda function sends notifications to stakeholders through Amazon Simple Notification Service (SNS) and initiates the failover process in Step Functions.

2. Step Function Execution: The Lambda function triggered by the CloudWatch alarm starts the execution of the failover state machine, automating the following steps:

a. EFS Replication Management: Stops EFS replication and starts reverse replication from the recovery Region back to the primary Region to prepare for failback.

b. S3 Replication Management: Stops Amazon S3 replication and starts reverse replication from the recovery Region back to the primary Region to prepare for failback.

c. CloudFront Update: Update of CloudFront distribution settings to point to the recovery S3 bucket.

d. Scaling EC2 Instances: Adjust the desired capacity for the Amazon Auto Scaling Group in the recovery Region to accommodate the redirected load.

Automated failback steps with Step Functions

1. Manual Trigger: Initiate the failback process by manually starting another Step Functions state machine designed for failback, or automatically trigger it based on certain conditions being met (such as restoration of the primary Region’s health).

2. Step Function Execution: Once the failback state machine is triggered, assuming the primary Region is now healthy, the following steps are automated:

a. EFS Replication Management: Stop EFS replication and resume replication from the primary Region to the recovery Region.

b. S3 Replication Management: Stops Amazon S3 replication and resumes replication from the primary Region to the recovery Region.

c. CloudFront Update: Update of CloudFront distribution settings to point to the primary S3 bucket.

d. Scaling EC2 Instances: Wait for the primary Region to become available and healthy once again, and then adjust the desired capacity for the Amazon EC2 Auto Scaling Group in the recovery Region to zero to shut down the EC2 instances used during the failover.

3. Amazon Route 53 Failback: Once the services are restored in the primary Region, Route 53 failover routing automatically updates to point back to the original primary endpoint.

Regular testing

It’s essential to regularly test both the failover and failback processes to make sure they work as expected under various scenarios. This testing should be as comprehensive as possible, such as simulating actual disaster scenarios, to make sure that the DR strategy is effective and that the team is prepared for real-world events.

Incorporating these detailed steps into the DR plan helps make sure of a more robust and reliable failover and failback process, reducing the potential for issues and downtime during critical periods.

AWS Storage Blog