Create a fallback migration plan for your self-managed MySQL database to Amazon Aurora MySQL using native bi-directional binary log replication

In the age of agile development and automated cloud deployment, most relational database migrations still must follow a sequential approach. It is common for a single database to support multiple applications or modules and involve multiple teams. Usually, all applications must move together at the same time. If you’re planning to migrate to Amazon Aurora MySQL-Compatible Edition from a self-managed MySQL environment on premises or on Amazon Elastic Compute Cloud (Amazon EC2), you may be looking for a simpler fallback mechanism or a way to migrate applications incrementally.

In this post, we show you how to set up bi-directional replication between an on-premises MySQL instance and an Aurora MySQL instance. We cover how to configure and set up bi-directional replication and address important operational concepts such as monitoring, troubleshooting, and high availability. In certain use cases, native bi-directional binary log replication can either provide a simpler fallback plan for your migration or provide a way to migrate applications or schemas individually, rather than all at the same time.

Three use cases for bi-directional replication

If your MySQL deployment is for a single application or you have a monolithic workload with many dependencies between components, you can use bi-directional replication for a simplified fallback plan. With database changes flowing in both directions, if something unexpected happens during the migration to Amazon Aurora, you can quickly redirect applications to the original source database.

Additionally, bi-directional replication can facilitate an incremental multi-tenant migration. A multi-tenant MySQL deployment contains a schema or database for each customer or tenant. After bi-directional replication is established between the self-managed database and Aurora, it may be possible to migrate each tenant separately.

Lastly, your MySQL deployment may support multiple applications that use mutually exclusive sets of tables or separate schemas. In this case, each application may be migrated separately.

Considerations for bi-directional replication

Bi-directional replication is based on MySQL native binary log replication. Native replication in MySQL is asynchronous. There can be replication lag between the source and target. There is no built-in conflict resolution in native MySQL replication. Therefore, it is not recommended to write data to the same tables on both sides of a bi-directional replication. Doing so may result in errors that cause replication to stop in one or both directions. Both servers could contain different data. This is commonly referred to as split brain. A basic explanation can be found in Split Brain 101: What You Should Know. In our test scenario in this post, we mitigate insertion collisions by setting the auto_increment_increment and auto_increment_offset variables. Simultaneous deletions or updates can still cause problems that are difficult to reconcile.

Troubleshooting MySQL replication is a time-consuming process that may include manual data inspection and verification, and manual data changes such as manual inserts, updates, or deletions. Manual skipping of transactions in MySQL or manual skipping of transactions in Aurora may also be necessary.

During the migration process, make sure that all processing for each schema or each application is fully stopped. Verify that the configuration for the application is changed to the new target before restarting the application.

Solution overview

The following diagram illustrates the architecture specific to this post.

Architecture diagram for bi-directional replication between on-premises MySQL to Amazon Aurora MySQL Compatible Edition

The steps to enabling bi-directional replication are as follows:

Enable binary logs on the on-premises MySQL instance.
Create a custom Aurora parameter group with parameters necessary to enable binary logs on the cluster.
Back up your source database and restore it to Aurora using one of the available migration options. We use Percona Xtrabackup in this example, and we copy the resulting files to an Amazon Simple Storage Service (Amazon S3) bucket. For more details, see Physical migration from MySQL by using Percona Xtrabackup and Amazon S3.
Configure bi-directional replication.
Validate replication status.
Optionally, test bi-directional replication during an Aurora high availability failover event.

Please note the Limitations and recommendations for importing backup files from Amazon S3 to Amazon RDS and Replication Compatibility Between MySQL versions. A supported bi-directional replication will be between equal major versions.

The screenshots in this post use MySQL 5.7 and Aurora version 2. The same procedure has been tested on MySQL 8.0 and Aurora version 3.

Prerequisites

The sample solution requires an existing on-premises database instance or existing database instance on Amazon EC2. The sample solution also requires that you have a set of networking resources in place, such as the following:

A VPC
Network connectivity:
- A database security group
- AWS Virtual Private Network (AWS VPN) or AWS Direct Connect providing private IP connectivity from the source database instance to Aurora.

Enable binary logs on the on-premises MySQL instance

To configure the following parameters to enable binary logs on the on-premises MySQL instance, add the following lines in the /etc/my.cnf configuration file.

server-id=1
log-bin=mysql-bin
binlog_format=ROW
sync_binlog=1
gtid_mode=ON

enforce_gtid_consistency=ON
log_bin_trust_function_creators=ON
auto_increment_increment=2
auto_increment_offset=2

Because log_bin is a static variable, you must restart MySQL to enable the binary log. Note that the parameters set here will match the parameters set in Aurora in the next step. We want all of the binary log related parameters to be the same, with one exception. The auto_increment_increment will be different, so that auto-incrementing integer values written on the on-premises server will be even numbers, and auto-incrementing integer values written on Aurora will be odd numbers.

You can set expire_logs_days (MySQL 5.7) or binlog_expire_logs_seconds (MySQL 8.x) to a higher value, if needed. The larger your database and the longer the restore time, the more binary logs you will need in order to successfully set up binary log replication.

For more information, see Binary Logging Options and Parameters.

Create a custom Aurora parameter group

In Aurora, there are two types of parameter groups: cluster level and instance level. Binary log parameters are only available in a cluster-level parameter group. To set custom values for the required parameters, you must create a custom cluster-level parameter group. Refer to Working with DB cluster parameter groups for more details. You use this custom parameter group in a later step when you restore a new Aurora cluster from your backup in Amazon S3.

The parameters to modify are as follows:

binlog_format                      ROW
gtid_mode                          on
enforce_gtid_consistency           on
log_bin_trust_function_creators    1
auto_increment_offset              1
auto_increment_increment           2

You can verify the configuration changes on the Amazon RDS console. On the Parameter groups page, select your newly created parameter group and on the Actions menu, choose Compare.

Amazon RDS Parameter Group listing

Check the new parameter group values against the original default in order to confirm the configuration changes.

Amazon RDS Parameter group value comparison, custom vs default

Back up the source database

There are several ways to back up the source database and restore to an Aurora MySQL cluster. For this post, we take a backup of the source database using the Percona XtraBackup tool. The following code is an example of how to back up a database on premises or on Amazon EC2:

xtrabackup --backup -u mysqluser -ppassword --parallel=4 --target-dir=/opres

The backup file is created in the /opres directory. It’s important to confirm that the backup completed successfully by looking for the completed OK message at the end of the log file.

End of the xtrabackup log file indicating that the backup completed OK

You can use the AWS Command Line Interface (AWS CLI) or the AWS SDK to transfer the backup contents to Amazon S3. For example, an AWS CLI command can synchronize all the contents of the /opres directory into your bucket:

aws s3 sync /opres s3://your-bucket-name

upload status of files to Amazon S3

Restore the backup to an Aurora database

After a consistent database full backup is stored in your S3 bucket, you can perform the restore.

On the Amazon RDS console, choose Databases in the navigation pane.
Choose Restore from S3.
For S3 bucket, enter the bucket where the backup files are stored.
For Engine type¸ select Aurora (MySQL Compatible).
For Available versions, choose the version that most closely matches your on-premises database. For this example, we choose Aurora (MySQL 5.7) 2.11.2.
For IAM role, choose an appropriate AWS Identity and Access Management (IAM) role with permissions to do the restore, or create a new role.
For Availability & durability, select Create an Aurora Replica or Reader node in a different AZ (recommended for scaled availability).
Expand the Additional configuration section under Database options.
For DB cluster parameter group, choose the custom parameter group you created in the previous section of this post.
Complete the other options to finish the restore configuration, then choose Create database. When the database is ready, its status will show as Available.
Select the new cluster and verify the parameter group on the Configuration tab.

Configure bi-directional replication

At this point, both the on-premises MySQL instance and the Aurora MySQL instance are ready for replication to be enabled. Connect to both instances with a privileged user and create the replication user accounts. You can get the cluster writer endpoint from the Amazon RDS console on the Connectivity & security tab.

Aurora cluster read-write endpoint listed in Connectivity & Security section under the cluster.

Connect to your on-premises MySQL instance.

Run the following commands to create the replication user:

CREATE USER IF NOT EXISTS 'replicator'@'%' IDENTIFIED with mysql_native_password BY 'Rep#12345';
GRANT SELECT,REPLICATION CLIENT, REPLICATION SLAVE ON *.* TO 'replicator'@'%';
FLUSH PRIVILEGES;

Configure the Aurora MySQL instance as a replica of the on-premises MySQL database. When this is complete and replication is started, the user you created will be replicated to Amazon Aurora MySQL, along with any other changes. On Amazon Aurora MySQL, call the command mysql.rds_set_external_master_with_auto_position:
```
CALL mysql.rds_set_external_master_with_auto_position('172.31.xxx.xxx', 3306,
'replicator', 'Rep#12345', 0);
```
Start the Aurora replication with the command CALL mysql.rds_start_replication against the Aurora MySQL database. Consider adjusting the retention of the binary log files in Aurora for at least 2 days with the command CALL mysql.rds_set_configuration(‘binlog retention hours’, 48);.

Configure the on-premises MySQL instance as a replica of the Aurora MySQL instance:

CHANGE MASTER TO
MASTER_HOST = 'cl-mysqlop.cluster-XXXXXXXXXX.us-east-1.rds.amazonaws.com',
MASTER_PORT = 3306,
MASTER_USER = 'replicator',
MASTER_PASSWORD = 'Rep#12345',
MASTER_AUTO_POSITION=1;

Finally, run start slave; to start replication.

You have now established replication in both directions. We check the replication status in the next section.

Refer to Migrating data from an external MySQL database to an Amazon Aurora MySQL DB cluster for additional specifics.

Validate replication status

To make sure that replication is configured correctly, run the command show slave status\G to confirm that each MySQL database has the correct Master_Host information and that the settings Slave_IO_Running and Slave_SQL_Running are both showing yes. Make sure that there are no errors listed in this output by examining the Last_Errorno and Last_error fields. Run the command repeatedly to monitor Seconds_Behind_Master to verify that this value approaches 0 over time. The following is an example terminal output:

mysql> show slave status \G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: xx.xx.xx.xx
                  Master_User: rdsrepladmin
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin-changelog.000001
          Read_Master_Log_Pos: 154
               Relay_Log_File: relaylog.000003
                Relay_Log_Pos: 330
        Relay_Master_Log_File: mysql-bin-changelog.000001
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: mysql.rds_replication_status,mysql.rds_monitor,mysql.rds_sysinfo,mysql.rds_configuration,mysql.rds_history
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 154
              Relay_Log_Space: 530
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: Yes
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 2144297276
                  Master_UUID: 9bfd6532-ea72-3141-b8c1-b2217ac70769
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 
            Executed_Gtid_Set: 
                Auto_Position: 0
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
1 row in set (0.00 sec)

mysql>

Running the show slave status\G command against the Aurora MySQL instance should output something similar to the preceding code.

You can also validate replication lag status by using the Amazon CloudWatch metric AuroraBinLogReplicaLag. If replication is not running, there will be no line. If replication is running behind, this metric will show a non-zero number of seconds. If replication is caught up, the graph will display 0 seconds.

AuroraBinlogReplicaLag cloudwatch metric graph

Test bi-directional replication during an Aurora high availability failover event

High availability for Aurora DB instances provides a failover mechanism when the primary instance encounters a problem. An Aurora reader instance takes over as the primary instance. To test high availability in an AWS Region, you can manually invoke a failover event via the Amazon RDS console to promote the read replica. In this test, we invoke the failover while inserting 1,000 rows every 2 seconds.

For this example, we create a persona table in both databases and start the test with 537,384 rows in each table.

Console script representation of counts in the 'persona' table

In the following example, we start the script to insert data into both databases (terminal 1):

Inserting 1000 rows from mysql on-premises
Inserting 1000 rows from AWS AURORA
Inserting 1000 rows from mysql on-premises
Inserting 1000 rows from AWS AURORA
Inserting 1000 rows from mysql on-premises
Inserting 1000 rows from AWS AURORA
Inserting 1000 rows from mysql on-premises
Inserting 1000 rows from AWS AURORA
Inserting 1000 rows from mysql on-premises
Inserting 1000 rows from AWS AURORA
Inserting 1000 rows from mysql on-premises
Inserting 1000 rows from AWS AURORA
Inserting 1000 rows from mysql on-premises
Inserting 1000 rows from AWS AURORA
Inserting 1000 rows from mysql on-premises
Inserting 1000 rows from AWS AURORA
Inserting 1000 rows from mysql on-premises
Inserting 1000 rows from AWS AURORA
Inserting 1000 rows from mysql on-premises
Inserting 1000 rows from AWS AURORA

Terminal 2 shows us that bi-directional replication is working by incrementing the row counts on both sides:

Counting rows in table 'persona' on-premises
2023-08-17 02:24:29	537384
Counting rows in table 'persona' on AURORA
2023-08-17 02:24:29	537384
Counting rows in table 'persona' on-premises
2023-08-17 02:24:34	539384
Counting rows in table 'persona' on AURORA
2023-08-17 02:24:34	539384
Counting rows in table 'persona' on-premises
2023-08-17 02:24:36	541384
Counting rows in table 'persona' on AURORA
2023-08-17 02:24:36	541384

After we confirm that bi-directional replication is working as expected, we can invoke a failover from the Amazon RDS console. Select the writer instance and on the Actions menu, choose Failover.

Seconds later, the Aurora inserts fail as expected (terminal 1), but the inserts into the on-premises MySQL database continue without error:

Inserting 1000 rows from mysql on-premises
Inserting 1000 rows from AWS aurora
ERROR 2003 (HY000): Can't connect to MySQL server on 'cl-mysqlop-cluster-xxxxxx.us-east-1.rds.amazonaws.com' (111)
Inserting 1000 rows from mysql on-premises
Inserting 1000 rows from AWS aurora
ERROR 2003 (HY000): Can't connect to MySQL server on 'cl-mysqlop-cluster-xxxxxx.us-east-1.rds.amazonaws.com' (111)
Inserting 1000 rows from mysql on-premises
Inserting 1000 rows from AWS aurora
ERROR 2003 (HY000): Can't connect to MySQL server on 'cl-mysqlop-cluster-xxxxxx.us-east-1.rds.amazonaws.com' (111)
Inserting 1000 rows from mysql on-premises
Inserting 1000 rows from AWS aurora
ERROR 1290 (HY000) at line 1: The MySQL server is running with the --read-only option so it cannot execute this statement
Inserting 1000 rows from mysql on-premises
Inserting 1000 rows from AWS aurora
Inserting 1000 rows from mysql on-premises
Inserting 1000 rows from AWS aurora

In the Terminal 2, we can see that while the failover is taking place, our count queries are also failing with connection errors:

Counting rows in table 'persona' on AURORA now()	count(*)
2023-08-17 02:25:24	571384
Counting rows in table 'persona' on-premises 	now()	count(*)
2023-08-17 02:25:25	571384
Counting rows in table 'persona' on AURORA
ERROR 2003 (HY000): Can't connect to MySQL Server on 'cl-mysqlop-cluster-xxxxxx.us-east-1.rds.amazonaws.com' (111)
Counting rows in table 'persona' on-premises    now()   count(*)
2023-08-17 02:25:27	572384
Counting rows in table 'persona' on AURORA
ERROR 2003 (HY000): Can't connect to MySQL Server on 'cl-mysqlop-cluster-xxxxxx.us-east-1.rds.amazonaws.com' (111)
Counting rows in table 'persona' on-premises    now()   count(*)
2023-08-17 02:25:29     573384
Counting rows in table 'persona' on AURORA
ERROR 2003 (HY000): Can't connect to MySQL Server on 'cl-mysqlop-cluster-xxxxxx.us-east-1.rds.amazonaws.com' (111)

On the Amazon RDS console, you can validate the failover and see that the instance cl-mysqlop-instance-2 is now the writer instance.

Verification that cl-mysqlop-instance-2 is now the writer instance

The failover should occur in seconds. If we monitor the script, we can see that the errors clear up quickly and shortly thereafter, the bi-directional replication synchronizes any changes, and both tables have matching row counts of 605,384 rows.

The following is the view in terminal 1:

Inserting 1000 rows from mysql on-premises
Inserting 1000 rows from AWS AURORA
Inserting 1000 rows from mysql on-premises
Inserting 1000 rows from AWS AURORA
Inserting 1000 rows from mysql on-premises
Inserting 1000 rows from AWS AURORA
Inserting 1000 rows from mysql on-premises
Inserting 1000 rows from AWS AURORA
Inserting 1000 rows from mysql on-premises
Inserting 1000 rows from AWS AURORA
Inserting 1000 rows from mysql on-premises
Inserting 1000 rows from AWS AURORA
Inserting 1000 rows from mysql on-premises
Inserting 1000 rows from AWS AURORA
Inserting 1000 rows from mysql on-premises
Inserting 1000 rows from AWS AURORA
Inserting 1000 rows from mysql on-premises
Inserting 1000 rows from AWS AURORA
Inserting 1000 rows from mysql on-premises
Inserting 1000 rows from AWS AURORA

The following is the view in terminal 2:

Counting rows in table persona on-premises

Counting rows in table 'persona' on-premises
2023-08-17 02:26:19	588384
Counting rows in table 'persona' on AURORA
2023-08-17 02:26:19	599384
Counting rows in table 'persona' on-premises
2023-08-17 02:26:21	601384
Counting rows in table 'persona' on AURORA
2023-08-17 02:26:22	589384
Counting rows in table 'persona' on-premises
2023-08-17 02:26:24	603384
Counting rows in table 'persona' on AURORA
2023-08-17 02:26:25	590384
Counting rows in table 'persona' on-premises
2023-08-17 02:26:27	605384
Counting rows in table 'persona' on AURORA
2023-08-17 02:26:27	605384

With this test, we can confirm that the Aurora cluster is able to maintain synchronization with the on-premises MySQL instance after a failover event without any manual intervention. Failover and binary log synchronization times can vary depending on the write activity and the number of unapplied records in the binary logs.

Clean up

To avoid incurring future charges, delete any resources that you created as part of this post, such as the following:

VPC
Database subnet group
Database security group or security group rules
AWS VPN connection
Aurora cluster
S3 bucket
EC2 instance

Conclusion

In this post, we demonstrated how to configure bi-directional replication from an on-premises MySQL or EC2 instance to an Aurora MySQL instance. We showed that bi-directional replication self-heals and continues successfully during a high availability failover within the same Region, without the need for manual intervention.

Bi-directional replication is one solution that may provide organizations with flexibility and reduced risk when migrating to Amazon Aurora MySQL. We encourage you to refer to Working with Amazon Aurora MySQL and learn about the other capabilities of Amazon Aurora MySQL that can help transform your organization.

About the Authors

Elkin González is a Sr. Database Specialist Solutions Architect at AWS located in Bogotá, Colombia. He has expertise in RDBMS and cloud technologies, with over 15 years of experience working for high-tech companies such as IBM and Thales. He is currently helping partners and organizations in Latino América design, modernize, and adopt AWS database cloud services. In his spare time, he loves spending time with family, friends, and his dog.

John Scott is a Sr. Database Specialist Solutions Architect based in Atlanta. He has years of experience as a Linux Admin and DBA for companies of all sizes. John’s favorite things are family, great barbecue, and Amazon Aurora.