AWS Database Blog
How VMware consolidated a multi-tenant cloud asset data store on Amazon Aurora MySQL with Amazon RDS Proxy
This post is co-written with Peter Fein, Staff Engineer 2 at VMware
VMware Tanzu CloudHealth, consolidated a multi-tenant, self-managed, 166-node sharded MySQL databases to Amazon Aurora MySQL-Compatible Edition and Amazon RDS Proxy. The goal was to support long-term, continuous, multi-factor data growth on their platform while improving reliability and simplifying operations. VMware Tanzu CloudHealth is a leading software as a service (SaaS) cloud cost management platform with over 20,000 customers supporting over $20 billion in annual cloud spend. The multi-tenant SaaS platform was built using AWS services from inception. VMware Tanzu CloudHealth chose Amazon Aurora because of its rich set of features including RDS Proxy, which was essential to meet the goals of this migration.
Aurora is a MySQL and PostgreSQL-compatible relational database built for the cloud. Aurora combines the performance and availability of traditional enterprise databases with the simplicity and cost-effectiveness of open-source databases.
RDS Proxy is a fully managed, highly available database proxy for Amazon Relational Database Service (Amazon RDS) that makes applications more scalable, more resilient to database failures, and more secure.
In this post, we discuss how VMware Tanzu CloudHealth consolidated and migrated the databases to Amazon Aurora from the Amazon Elastic Compute Cloud (Amazon EC2) C5 instance-based shards. We also discuss how it helped VMware to consolidate more databases in one Aurora cluster to shrink the number of databases instances they were self-managing achieving 3:1 schema consolidation. At end of the project, 62 Aurora clusters (124 Aurora DB instances, one primary DB instance and one read replica in multi-AZ setup for each cluster) replaced 166-node sharded MySQL databases (332 self-managed EC2 instances for HA). RDS Proxy was integrated in front of the Aurora clusters to support heavy application connection load from a wide application space. In the end, RDS Proxy connection multiplexing allowed VMware to reduce the number of connections to each Aurora cluster by a factor of 10, compared to connections initiated by applications against the proxy, greatly reducing database load. This was key driver to consolidate more databases (3:1 schema consolidation), improve application performance, and enhance end customer experience. Also, better performance on Amazon RDS R6g instances powered by AWS Graviton2 processors resulted in a 63% reduction in the required numbers of vCPUs resulting in 20% cost reduction.
VMware Tanzu CloudHealth’s asset data store: pre-Aurora architecture
The platform’s previous data store architecture was based on self-managed MySQL databases. These databases stored customer cloud asset data, usage metrics, and operational data. Before the migration, they were managing 166 MySQL version 5.6 shards on C5 EC2 instances. Each shard was composed of a primary and a standby replica, but only the primary had operational load. Sharding is implemented at the customer level with groups of customers co-located on a single database (MySQL schema). Each schema has a unique name and contained an average of 200 customers pre-migration. The databases were under continuous high load and shards were managed using complex Chef recipes, requiring a high degree of management overhead, performance and scaling issues.
Key drivers to modernize the cloud asset data store
Over the past decade, the customer base has grown significantly, resulting in increase of database capacity that needed to be self-managed. This growth also caused database operational, scaling, and resiliency issues. As a result, VMware started looking at options to migrate to cloud-managed databases.
The following key factors drove the push for modernization:
- Move away from single EC2 instance MySQL limitations. Difficulty provisioning EC2 instances of the right size due to variable load and storage needs
- MySQL binlog replicas were only used for failover, not for read scaling
- Shard expansion required a platform maintenance outage window
- Large customer and partner tenant onboarding required manual intervention to get proper balancing across shards
- Schema changes across 166 shards reduced velocity of code deployments
- Complex Chef recipes to configure databases required specialized knowledge
Why they migrated to Aurora and RDS Proxy
The VMware Tanzu CloudHealth organization had been discussing the need to do a database modernization project for a long time. As a long-time AWS customer, they looked at the RDS family and found Aurora MySQL to be a good fit. They considered several features to match requirements, as detailed in the following figure.
From the above set of features, following features were important for the migration:
- High performance and push button compute scalability
- Storage auto scaling
- Amazon Aurora Parallel Query for Aurora MySQL
- Multi-AZ deployments with Amazon Aurora Replicas
- Fault-tolerant and self-healing storage
- Amazon RDS Blue/Green Deployments
- Optimized I/O costs
In the following sections, we walk through the phases of VMware’s planning and migration approach.
Phase 1: Compliance and performance testing
VMware used the MySQL slow query log feature to capture a large sample of SQL queries over a 24-hour period of time and built a tool to replay the read queries in the logs and capture response times against both Aurora and EC2 MySQL. This helped them move towards shard consolidation to optimize top long-running SQL queries, database instance resources (CPU, memory, and I/O) utilization, and overall performance.
The result of this phase was as follows:
- Aurora MySQL 5.7 compatible version was 20% faster in first row response time under load with one primary and one reader instance when compared with MySQL 5.6 on EC2 C5.
- The exact hardware used in the test was a c5.9xlarge with 36 vCPU compared to Graviton r6g.4xlarge with 16 vCPU. With less than half the vCPU capacity, Aurora and Graviton still bested the EC2 instance.
- No application code (SQL syntax) changes were required to use the Aurora MySQL 5.7 compatible version upgraded database. As a result, it was easy to test the application using Aurora MySQL.
At the end of this phase, the VMware team was confident in terms of performance and compatibility to move forward with the Aurora MySQL database migration.
Phase 2: First Aurora shards and internal customers
During this phase, VMware added the first Aurora cluster to the database farm. They then initialized some of their tenants on this shard for a full integration testing period. During this phase, VMware also started building out an operational environment for Aurora MySQL as they moved away from Chef-based Amazon EC2 deployment. This modernization phase enhanced database operations by leveraging other AWS services such as AWS Lambda, Amazon CloudWatch, and AWS Secrets Manager.
AWS Secrets Manager helps you manage, retrieve, and rotate database credentials, API keys, and other secrets throughout their lifecycles.
AWS Lambda is a serverless, event-driven compute service that lets you run code for virtually any type of application or back-end service without provisioning or managing servers. You can trigger Lambda from over 200 AWS services and software as a service (SaaS) applications, and only pay for what you use.
The VMware team also started to work on Aurora optimal configuration with parameter group settings. Components constructed during this time included:
- A provisioning system using Terraform to simplify the creation of Aurora clusters with size-dependent parameter group settings. Amazon Route 53 records that are used to provide compatibility with their application DNS. These are automatically provisioned as well.
- An AWS Lambda function to modernize local data integrity check scripts and local MySQL initialization scripts that were part of Chef recipes.
- A monitoring dashboard and alerting based on Amazon CloudWatch data.
- Database migration process that used scripts to back up (logical backup method, namely the mydumper and myloader migration tool) and restore the Amazon EC2-based MySQL database into Aurora MySQL cluster. Standard MySQL binlog replication was used to synchronize ongoing write changes to the new Aurora shard before cut over. This process also allowed for consolidation as multiple MySQL databases were migrated onto a single Aurora cluster. The logical Database structure was maintained.
- Adoption of Amazon RDS Performance Insights across the organization to help diagnose database performance issues and make the overall database monitoring process easier.
Phase 3: Start shard migration
During this phase, the team began migrating customer shards to Aurora. As the migration occurred, the applications were tuned to leverage the Aurora architecture. Accomplishments included:
- Modification of common database access libraries to move read loads to the Aurora cluster reader endpoint. This allowed applications to access the reader endpoint with minimal code changes.
- Integration of RDS Proxy in front of the Aurora clusters to better manage connection load and improve database availability. RDS Proxy also provides support for upsizing, parameter group changes, and minor version upgrades with near zero downtime.
- With RDS Proxy, VMware was able to initiate a consolidation plan that migrated three EC2 shards onto a single Aurora cluster. In this case, RDS Proxy connection multiplexing allows significantly higher client connection counts to each Aurora cluster without impacting performance.
- Normalization of MySQL client connection parameters across the application ecosystem to make better use of RDS Proxy connection pools.
Phase 4: Migration final phase and wrap up
The final phase included the following components:
- Conduct a metrics review of all Aurora clusters for the final consolidation plan and proper instance sizing
- Migrate the last remaining shards in the system, including some of the largest at over 1 TB of total storage
- Continued remapping of DB read load across applications to Aurora reader endpoints
- Purchase Amazon RDS Reserved Instances to optimize Aurora database instance costs
- Train and education for team members on the new Aurora stack
Benefits of consolidation to Aurora
At the end of the project, the consolidated multi-tenant customer cloud asset database farm on Aurora achieved 3:1 consolidation compared to the standalone MySQL databases. This was accomplished with all schemas maintained and no consolidation of individual tables. All application frameworks across the VMware Tanzu CloudHealth application ecosystem were modified to use Aurora reader endpoints. The new provisioning system built in Terraform allows for easy creation of Aurora clusters, RDS Proxy, RDS parameter groups, and Route 53 entries with a few lines of code.
The following are some key metrics resulting from migration to Aurora MySQL:
- Post-migration Aurora instance count is 124 r6g.4xlarge and r6g.8xlarge Aurora Graviton2 instances that make up 62 two-node clusters (nodes spread across 2 AZ’s for high availability). This replaced 166 EC2 shards on 332 C5 instances for a total reduction of 63% in required vCPU.
- The migration resulted in a 20% cost reduction that continues to increase over time.
- RDS Proxy supports heavy application connection load, including intensive horizontally scaled applications deployed on Amazon Elastic Kubernetes Service (Amazon EKS) clusters. They achieved 10:1 connection compression during load spikes with flat stable DB connection counts.
- The Aurora query cache is a big part of the performance gains over MySQL. Across all shards an average 40% of all queries are cached.
- Large savings by operations person factor. Database Administration Team members could focus on new strategic projects instead of keeping the previous system up and running. Reduction in pages and operational issues boosted team morale and time spent working off hours.
- Minor engine upgrades completed with minimal downtime. The farm is currently running on the latest Aurora version 2 minor release.
- Simplified monitoring with CloudWatch and Performance Insights.
- Responsive support from AWS:
- Weekly project review meetings with the Aurora and RDS Proxy team members during core migration phase.
- Provided SQL query optimization advice.
- Enhanced RDS Proxy to better support our use cases with schema consolidation.
Lessons learned
They had the following takeaways:
- Managing large numbers of Aurora clusters requires a strategy for managing parameter groups for both clusters and instances. You need to allow for parameter tuning for different instance sizes and may want per-cluster overrides. Things that required tuning in their environment are the common MySQL timeout, buffer sizes and max connections parameters.
- Take the time to understand all Aurora parameter group options and what they mean at both the cluster and instance level.
- When making application-related changes and configurations, keep in mind RDS Proxy performance. RDS Proxy performs best with consistency across connections parameters to maximize pooling and reduce pinning.
- It can be complicated and time-consuming to modify existing legacy applications to take advantage of Aurora reader endpoints, if the application doesn’t support read and write split capabilities. Be sure to do proper analysis of applications to ensure they can best use the Aurora cluster architecture.
Conclusion
In this post, we saw how VMware Tanzu CloudHealth modernized the primary data store for customer cloud asset data to support continued business growth. The modernized database solution on Aurora is highly resilient, scalable, and performant, and resulted in 20% cost savings.
Consider evaluating Aurora for your current database workload requirements to take advantage of innovative features like Global Databases, serverless compute, and cross-AWS Region disaster recovery. Also, review How to plan for a successful database modernization blog post for options and advice on how to get this complex database technology transformation right.
About the Authors
Peter Fein is Staff Engineer 2 at VMware, Cloud Management Unit. He works on the VMware Tanzu CloudHealth platform, architecting and engineering its data storage layers. He is very passionate about building and operating scalable SaaS application.
Rajesh Matkar is a Principal Partner Database Specialist Solutions Architect at AWS. He works with AWS Technology and Consulting partners to provide guidance and technical assistance on database projects, helping them improve the value of their solutions.
Sahil Thapar is a Principal Solutions Architect. He works with ISV customers to help them build highly available, scalable, and resilient applications on the AWS Cloud. He is currently focused on containers and machine learning solutions.