Performance testing MySQL migration environments using query playback and traffic mirroring – Part 2

This is the second post in a series where we dive deep into performance testing MySQL environments being migrated from on premises. In Part 1, we compared the query playback and traffic mirroring approaches at a high level. In this post, we dive into the setup and configuration of query playback. In Part 3, we discuss the setup and configuration of traffic mirroring.

Solution overview

For our first solution, we discuss query playback using the Percona pt-upgrade tool. pt-upgrade relies on capturing a time frame of your traffic, like a busy weekend or peak traffic event, and then replaying that recorded traffic against another environment at your team’s time of choice. If your application workload is highly variable, you might want to capture slow query log outputs from different parts of the day in order to capture a larger variety of queries.

This solution provides more control over when you run tests, enabling teams to run their tests at convenient times. Because you can replay your captured traffic multiple times, the solution also allows for consistent and repeatable load testing that can enable iterative testing for optimization of your migration environment.

The following diagram illustrates the query playback architecture using the Percona pt-upgrade tool.

Prerequisites

Complete the following prerequisites:

Create a backup of your on-premises database using Percona XtraBackup. Restore the backup locally to create a local test environment.
Copy the backup to your Amazon Simple Storage Service (Amazon S3) bucket.
Restore the backup to create your Aurora MySQL-Compatible database.
1. Make sure the RDS primary user that is created as part of the restore process doesn’t conflict with any users that are currently stored in the database.
2. In the case of testing a newer Aurora MySQL version, restore the backup to create your Aurora MySQL database and perform an in-place upgrade from Aurora MySQL 2 (5.7 compatible) to Aurora MySQL 3 (8.0 compatible).
3. When the restores are complete, take an DB Cluster snapshot of the newly created environments for easier recreation.

Implement query playback

To set up the solution, complete the following steps (as illustrated in the solution diagram):

Steps 1-3 are covered as part of the pre-requisites section.

Enable slow logs on the on-premises production MySQL database.
1. Set an appropriate long_query_time. A value of 0 makes sure all queries are captured, but can add a much larger overhead to the production database.
2. Copy the slow query logs to an Amazon Elastic Compute Cloud (Amazon EC2) instance with appropriate permissions and network access to the Aurora MySQL database and the test database instance at your on-premises location.
Install pt-upgrade on the EC2 instance and run pt-upgrade against both the on-premises testing environment and the RDS instance you want to test, which for this post is Aurora MySQL 2 (5.7 compatible).
1. To make sure the MySQL buffer pool is sufficiently pre-warmed, we recommend running the pt-upgrade read-only test at least twice.
2. To reduce the amount of variances in the test, the EC2 MySQL prod and test instances that were used were configured similar to an Amazon RDS for MySQL 5.7 instance in terms of MySQL configuration parameters.
Now that you have the environment fully configured, you can re-iterate the test to validate performance characteristics after engine level parameter tuning.

In the following sections, we walk through the commands and the corresponding outputs we used to test the solution.

Scenario 1: Replaying the full slow query log

For this post, we ran a 10-minute sysbench oltp_read_write.lua test with 16 threads, 10 tables, and a table size of 100,000 with no prepared statements. The following code shows the size of the slow log that was generated:

[ec2-user@ip-10-0-0-18 ~]$ ls -lah | grep slow
 -rw-r----- 1 ec2-user ec2-user 2.0G Jul  8 19:20 pt_upgrade-slow.log

To run pt-upgrade, use the read-only command using the full slow query log:

pt-upgrade h=<endpoint1> h=<endpoint2> -u <username> -p<password> --type slowlog <name_of_slow_log>.log 1><report_name>.out 2><error_log_name>.out &

Monitor its progress:

[ec2-user@ip-10-0-0-18 ~]$ tail -f aurora-prewarmed-err.out
/home/ec2-user/pt_upgrade-slow.log:  98% 04:35 remain
/home/ec2-user/pt_upgrade-slow.log:  98% 04:04 remain
/home/ec2-user/pt_upgrade-slow.log:  98% 03:33 remain
/home/ec2-user/pt_upgrade-slow.log:  99% 03:02 remain
/home/ec2-user/pt_upgrade-slow.log:  99% 02:33 remain
/home/ec2-user/pt_upgrade-slow.log:  99% 02:03 remain
/home/ec2-user/pt_upgrade-slow.log:  99% 01:33 remain
/home/ec2-user/pt_upgrade-slow.log:  99% 01:03 remain
/home/ec2-user/pt_upgrade-slow.log:  99% 00:33 remain
/home/ec2-user/pt_upgrade-slow.log:  99% 00:03 remain

The following shows snippets of the output report:

#-----------------------------------------------------------------------
# Hosts
#-----------------------------------------------------------------------

host1:

  DSN:       h=10.0.3.9
  hostname:  ip-10-0-3-9.eu-west-2.compute.internal
  MySQL:     MySQL Community Server (GPL) 5.7.44

host2:

  DSN:       h=aurorarestore.cluster-c5h2nrbsz4ea.eu-west-2.rds.amazonaws.com
  hostname:  ip-172-16-0-131
  MySQL:     MySQL Community Server (GPL) 5.7.44

########################################################################
# Query class 558CAEF5F387E929
########################################################################

Reporting class because there are 3 query diffs.

Total queries      3
Unique queries     3
Discarded queries  0

select c from sbtest? where id=?

##
## Query time diffs: 3
##

-- 1.

0.000188 vs. 0.004073 seconds (21.7x increase)

SELECT c FROM sbtest9 WHERE id=37974

-- 2.

0.000147 vs. 0.002020 seconds (13.7x increase)

SELECT c FROM sbtest6 WHERE id=50306

-- 3.

0.000135 vs. 0.001403 seconds (10.4x increase)

SELECT c FROM sbtest3 WHERE id=51957

#-----------------------------------------------------------------------
# Stats
#-----------------------------------------------------------------------

failed_queries        0
not_query             11
not_select            2572593
queries_filtered      0
queries_no_diffs      5923221
queries_read          8575328
queries_with_diffs    79503
queries_with_errors   0

Running the following command will use the full slow query log (read/write):

pt-upgrade --no-read-only h=<endpoint1> h=<endpoint2> -u <username> -p<password> <name_of_slow_log>.log 1><report_name>.out 2><error_log_name>.out &

Performance considerations

In the preceding output comparing the latency performance of MySQL 5.7 with Aurora MySQL 2, we can see that there are some instances where Aurora MySQL-Compatible has higher latency than community MySQL. Although this is good to know during testing, this is a less than ideal scenario for Aurora because the pt-upgrade workload is being replayed over a single database connection and doesn’t generate a concurrent workload where Aurora thrives.

From the preceding (read-only) output, we can see that 5,923,221 queries had no performance difference or output results, whereas 79,503 queries had different results. Based on the preceding test, there were no queries that were reported with errors such as showing different outputs for the same query. If you encounter either queries with different outputs or queries with errors, you must investigate the full output report (as shown earlier). pt-upgrade won’t show you the cause of a query being slower or why it produced a different result; you need to follow standard engine-level troubleshooting.

Scenario 2: Replaying a subset of a slow query log

Instead of replaying the entire slow query log, you can parse the slow query log using pt-query-digest and replay a subset of the queries to test with.

To generate the slow query log, parse the slow query log with pt-query-digest, or alternatively use the flag --max-class-size with pt-upgrade to generate the slow query log with a subset of queries:

pt-query-digest --sample 10 --no-report --output slowlog pt_upgrade-slow.log > parsed-log.log
pt_upgrade-slow.log:   4% 10:55 remain
pt_upgrade-slow.log:   9% 10:00 remain

[ec2-user@ip-10-0-0-18 ~]$ ls -lah | grep parsed-log.log
-rw-rw-r-- 1 ec2-user ec2-user  29K Jun 7 09:51 parsed-log.log

To run pt-upgrade, use the following read-only command using the full slow query log:

pt-upgrade h=<endpoint1> h=<endpoint2> -u <username> -p"<password>" output.out 1><report_name>.out 2><error_log_name>.out

The following shows snippets of the output report:

#-----------------------------------------------------------------------
# Logs
#-----------------------------------------------------------------------

File: output.out
Size: 271204

#-----------------------------------------------------------------------
# Hosts
#-----------------------------------------------------------------------

host1:

  DSN:       h=10.0.3.9
  hostname:  ip-10-0-3-9.eu-west-2.compute.internal
  MySQL:     MySQL Community Server (GPL) 5.7.44

host2:

  DSN:       h=aurorarestore.cluster-c5h2nrbsz4ea.eu-west-2.rds.amazonaws.com
  hostname:  ip-172-16-0-131
  MySQL:     MySQL Community Server (GPL) 5.7.44

########################################################################
# Query class 84D1DEE77FA8D4C3
########################################################################

Total queries      1
Unique queries     1
Discarded queries  0

select c from sbtest? where id between ? and ? order by c

##
## Query time diffs: 1
##

-- 1.

0.000297 vs. 0.005713 seconds (19.2x increase)

SELECT c FROM sbtest5 WHERE id BETWEEN 50287 AND 50386 ORDER BY c

#-----------------------------------------------------------------------
# Stats
#-----------------------------------------------------------------------

failed_queries        0
not_query             11
not_select            600
queries_filtered      0
queries_no_diffs      499
queries_read          1111
queries_with_diffs    1
queries_with_errors   0

As shown in the preceding output, the number of queries that were read in the log output was much lower than when using the full slow log output, and by extension, the amount of time taken to run the test was much lower.

Considerations

pt-upgrade supports a number of different options, from which it can parse and replay queries: slow query log, general log, binary log, and tcpdump. For this post, we focus on the slow query log, but you can use any of the other options.

Additionally, you can combine the use of pt-upgrade with another Percona tool, pt-query-digest, to pull a smaller number of samples for each query fingerprint. This might require additional compute resources on the EC2 instance that is created for testing, so you might want to experiment with different instance sizing.

To get consistent results for workloads that aren’t read only, you must have a consistent starting point for your database before you begin a replay test. There are a number of different tools that you can use to create the test environment: MySQL dump, MySQL shell, mydumper/myloader, and Percona XtraBackup. In this post, we use Percona XtraBackup . To reduce the possibility of encountering permission related errors during the migration, you should review your current database users and their associated permissions to make sure they are supported on Amazon RDS. However you are free to use any of the above mentioned options in order to see the database.

In this blog post we have primarily focused on migrations from on-premises to RDS MySQL/Amazon Aurora MySQL, however it is also possible to use pt-upgrade in order to test a migration from RDS MySQL to Amazon Aurora MySQL or to test performance of different database engine versions such as upgrading from Aurora MySQL 2 to Aurora MySQL3.

Conclusion

You can use pt-upgrade to test the query performance of compatible database engines, as well as different major and minor versions. Although the tool can help you find queries that perform poorly on a different database engine or version, there are some limitations in terms of query concurrency. Therefore, you may still experience locking or concurrency issues post-migration if using pt-upgrade is the only testing that is performed before a migration is completed.

Additionally, although pt-upgrade can help you find poorly performing queries, you need to use standard performance troubleshooting to investigate the cause of performance issues. In this post, we showed how to use pt-upgrade to test performance on Aurora MySQL-Compatible, but the same methodology also applies to Amazon RDS for MySQL upgrades and migrations.

To learn more about load testing using production traffic with a different approach of mirroring traffic, see Part 3 of this series.

About the Authors

Arnab Ghosh is a Sr. Solutions Architect for AWS in North America, helping enterprise customers build resilient and cost-efficient architectures. He has over 15 years of experience in architecting, designing, and developing enterprise applications solving complex business problems.

Patrick Gryczka is a Sr. Solutions Architect specializing in serverless technologies and the sports industry. Beyond the cloud, Patrick fills his time and space with cats, science fiction, Python, and Rust.

Simon Stewart is a Database Specialist Solutions Architect at AWS, focusing on MySQL and Amazon Aurora MySQL. Simon helps AWS customers with their architecture design and provides efficient solutions to run their database workloads on AWS. When not helping customers with their databases on AWS, he enjoys tinkering with his homelab.

AWS Database Blog