AWS Database Blog
Grouping database tables in AWS DMS tasks for Oracle source engine
AWS Database Migration Service is a cloud service designed to simplify the process of migrating and replicating databases, data warehouses and other data stores. It offers a comprehensive solution for both homogeneous and heterogeneous database migrations, facilitating transitions between different database platforms.
The migration process typically involves two major phases:
- Migration of existing data (full load) – During this phase, AWS DMS performs a full load migration, where all existing data from the source data store is transferred to the target data store.
- Ongoing replication (change data capture) – After the initial full load migration, AWS DMS enables ongoing replication of data changes through a process called change data capture (CDC). The changes are continuously replicated and applied to the target data store in real time or near real time, making sure the target data store remains synchronized with the source.
AWS DMS supports a wide range of source and target data stores, including relational databases, data warehouses, and NoSQL databases. Every migration project is unique and poses distinct challenges, making it impossible to have a one-size-fits-all solution. For a successful migration, it’s crucial to address the database and application prerequisites, with initial preparation and design playing a pivotal role in achieving this objective.
Although AWS DMS provides powerful tools for database migration, every migration project is unique and poses distinct challenges. Proper preparation and design are crucial for a successful migration process, particularly when it comes to performance tuning and addressing potential latency issues.
In this post, we provide guidelines on identifying potential root causes of full load and CDC latency early on, and offer recommendations for optimally grouping tables to achieve the best performance for the AWS DMS task. By following the approach outlined in this post, you can more effectively plan your database migration, estimate task sizes, and design an efficient migration process that minimizes potential issues and optimizes performance.
Know your data
Working on a migration project requires you to first understand what data you’re migrating. The full load migration phase involves transferring a complete copy of the source data to the target database. This process makes sure the target database is populated with an identical version of the data from the source database as at the point the migration started.
CDC is a technique that captures incremental changes to data and data structures (schemas) from the source database in real-time or near real-time. CDC enables the propagation of these changes to other databases or applications, making sure they remain synchronized with the source database.
You can assess the full load and CDC requirements for database migration by analyzing the database size, workload, and hardware information to provide recommendations on the number of tasks and table groupings required for each task.
Various factors can directly and indirectly influence the speed of data migration while using AWS DMS. The following are some of the common factors for full load as well as CDC:
- Database object size – Keeping huge tables (larger than 2 TB) in a dedicated migration task can help improve migration efficiency. By isolating the handling of extensive data sets within a specific task or operation, the migration process can potentially become more streamlined and effective.
- Partitioned and non-partitioned objects – You can migrate large partitioned tables by loading multiple tables in parallel. AWS DMS also allows you to load a single large table using multiple parallel threads. This is particularly useful for tables with billions of records having multiple partition and sub-partitions.
- Objects without a primary key and unique index – AWS DMS requires a primary key or unique key for source tables with large objects (LOBs) to migrate.
- LOBs – LOB columns need special treatment due to the fact that AWS DMS can’t determine LOB size per row per column to be able to allocate appropriate memory on the replication instance side. AWS DMS provides full, limited and inline LOB modes to migrate LOB data. Keeping LOBs in a separate task can help efficiently manage migration activity.
- Volume of changes – If your workload involves a high volume of CDC changes, updating the same set of records over and over again or inserting or updating and deleting the same records, you can improve the target apply throughput using batch apply.
Solution overview
The goal of this post is to analyze the source database’s data dictionary and combine that information with hardware details to create recommendations for efficient data migration tasks. This analysis helps determine the optimal number of AWS DMS tasks and grouping of tables within those tasks, reducing potential latency issues during the migration process.
The workflow involves the following steps:
1. Create a control table on the source database.
2. Populate the control table by analyzing table size, partitions, indexes, constraints, data types, and LOB data using data dictionary tables and views.
3. Capture the daily growth of tables by monitoring the volume of incoming changes.
4. Use code to group tables.
The following diagram illustrates the solution architecture.
Prerequisites
To follow along with this post, you should have familiarity with the following:
- AWS DMS
- Oracle Relational Database Service
- SQL and PL/SQL procedures
Create a control table
In this step, we create a control table named table_mapping which will provide a one-window view to understand what data we’re migrating. This table is created by referring to data dictionary information on the size of table, partition, partition size (count, average, minimum, maximum), LOB column, number of indexes, primary key or unique key constraints, foreign key constraints, and DDL/DML operation count of truncate, insert, update, and delete operations on the table.
This control table provides baseline data to use in the next step to group table.
- Connect to your source Oracle database.
- Run the following SQL block to create the control table:
Populate the control table
In this step, you populate the control table with data dictionary objects in Oracle databases that provide information about the size, type, partitioning, constraints, and LOB data related to database objects. Specifically, the following data dictionary objects are mentioned:
- DBA_SEGMENTS – Provides information about the size and type of database objects, such as tables, indexes, and clusters
- DBA_CONSTRAINTS – Contains details about the different types of constraints defined on database objects, such as primary keys, foreign keys, unique constraints, and check constraints
- DBA_LOBS – Provides information about LOB data types, such as Binary Large Objects (BLOBs) and Character Large Objects (CLOBs), which are used to store large amounts of unstructured data
These data dictionary objects can be queried to retrieve metadata about the database objects, allowing administrators and developers to identify and analyze the structure, size, constraints, and modifications made to various objects in the database.
We read from the data dictionary objects into the control table.
- Insert into the control table the database table details such as name, size, partition, and so on for the desired schema (here we are using ‘
ADMIN
‘):
- Update the
PK_PRESENT
field of the control table for DB tables that have primary keys defined:
- Update the
UK_PRESENT
field of the control table for database tables that have unique keys defined:
- Update the
LOB_COLUMN
field of the control table for database tables that have at least one LOB column:
Capture the volume of DML changes over time
The DBA_TAB_MODIFICATIONS
table in the Oracle database contains information about modifications made to all tables in the database since the last time statistics were gathered on those tables. This includes both partition and non-partition operations on the tables.
By reviewing the data from DBA_TAB_MODIFICATIONS
on a daily basis, you can gain insights into the daily modifications made to tables. This information can be valuable for identifying tables that have undergone a significant number of changes, which may impact their performance or require maintenance tasks like statistics gathering or reorganization.
- Create a staging table called
MST_DBA_TAB_MOD
to get the details fromDBA_TAB_MODIFICATION
:
- Populate the
MST_DBA_TAB_MOD
table with the daily average DML count for the tables by gathering the information from Oracle metadata viewsdba_tab_modifications
anddba_tables
. The counts are noted from the last analyze date and then averaged, so having recent table stats information will make it more accurate.
- Next, we populate the DML information in the control
table_mapping_mapping
table by running the following update statement where we collect this information fromMST_DBA_TAB_MOD
which we have populated in the previous step:
Categorize the tables by step number
Tables are then categorized in a database based on various factors related to their data characteristics, as shown in the following code. These factors include whether the table is partitioned or non-partitioned, whether it contains LOBs or not, the table size, and the number of DML operations performed on the table. Non-partitioned tables that don’t have LOB fields are categorized as step 1, non-partitioned tables with LOB fields are categorized as step 2, partitioned tables without LOB fields are categorized as step 3, and so on. For example, we have classified all tables reporting SIZE_IN_MB
as 0
as out of scope for the migration. You can add additional tables to this step if your use case needs it.
The categorization of tables based on these attributes is done for optimization purposes.
Group tables
The process of recommending groups begins by creating and populating the TABLE_MAPPING_GROUPS
table based on the information stored in the TABLE_MAPPING
control table. The process is initiated by a procedure that takes three parameters:
- Mapping table (
TABLE_MAPPING
) - Source migration schema (
ADMIN
) - Size of the DB objects per task (
600 GB
)
The size of the objects per task (600 GB) is chosen to evenly distribute the tables across tasks, considering factors such as CPU, network, and I/O capabilities of the source and target replication instances.
We use a shell script to call the stored procedure cap_parfile_grouping_gen.sql
to group the tables and then list the grouped tables using the step and groupnum
using another SQL file, temp_sql_gen.sql
:
Next, the procedure contained in cap_parfile_grouping_gen.sql
has two separate sections: one for partitioned tables and another for non-partitioned tables, because it groups based on the provided information. The TABLE_MAPPING_GROUPS
table is created and populated with the grouped tables from the TABLE_MAPPING
table. This grouping process helps evenly distribute the data across tasks, facilitating efficient replication or data movement operations.
The final temp_sql_gen.sql
comprises the following select statement to iteratively go over table_mapping_groups
and list the table names by the respective group names, which can be used to create the respective AWS DMS tasks:
Demonstration
For demonstration purposes, we use an Oracle source instance with size 4.4 TB to migrate data and a grouping task in sets of 600 GB. The instance type used is db.m5.4xlarge, with 8 cores, 16 vCPU, and 64 GiB of memory. After evaluating CPU utilization, network bandwidth, and workload, it was determined that grouping tasks in sets of 600 GB would be ideal for this workload. However, it’s recommended to carefully review the source database size, workload, CPU, memory, and network utilization to identify the appropriate group size, because each source is different.
The algorithm involves sorting step information from a loop table in descending order, and then grouping the sorted data based on a specified size input parameter.
We have clubbed the following steps under a shell script so you don’t have to manually run the individual steps:
- Create a control table
- Populate the control table
- Capture the volume of DML changes
- Categorize the tables by step number
The following code is the associated shell script:
The following code is the associated table_mapping
SQL file:
The following code is a snippet of the control table table_mapping
, populated from the data dictionary after running the shell script mentioned in the previous step:
The following is a screenshot of the code snippet.
After the table grouping script is run, the final output consists of six initial step numbers (0 – 5):
- 0 for ignore or skip for special handling tables
- 1 for non-partitioned tables
- 2 for non-partitioned tables with LOBs
- 3 for partitioned tables
- 4 for partitioned tables with LOBs
- 5 for high DML tables
Next, we run the table grouping shell script to create the final grouping of the tables. Tables that are part of a group will be included under a single AWS DMS task. We already went over the code details in prior sections.
The following code illustrates the shell script details:
The PL/SQL procedure is used to generate a final recommendation for migrating data tables. It involves running a PL/SQL block with the following steps:
- Check and drop an existing table named
table_mapping_groups
to create a fresh one - Create a new
table_mapping_groups
table fromTABLE_MAPPING_GROUPS
- Group partitioned tables with LOB and non-LOB columns based on a group size of 600, resulting in groups like 10, 11, 12, 13
- Group non-partitioned tables with LOB and non-LOB columns using a similar procedure
The process accommodates tables requiring special handling by separating them into distinct groups or excluding them from the migration based on specific requirements. The migration process described here is unique and may vary based on individual case requirements. In the following example, tables are grouped based on ${groupsize}
of 600 GB:
Group numbers are generated by concatenating two columns, step
and sub_step
. For example, step 1
and sub_step 0
combined to become group number 10 with 34 tables and a size of 547 GB.
Similarly, the following groups are also created:
In the preceding output, we have grouped tables as close to the desired group size of 600 GB as possible. For example, group 10 contain 34 tables with a total size of 547 GB. A single huge table with a size of 1.5 TB has been kept in a separate task as group 13 because we don’t want to split a single table. Depending on the number of tables and table sizes, some groups will be either less or greater than the desired group size of 600 GB. Therefore, a database of 4.4 TB is migrated with 10 tasks by grouping them based on partition, non-partition LOB, and DML volume.
After the script runs, the following files are generated with the table names:
Table names that part of the same file as shown in the following code should be grouped together in a single task:
Clean up
As part of this demonstration, we created a few tables on the source Oracle database, you can clean up these tables, using the following drop statements after the task grouping has been noted:
Conclusion
In this post, we discussed how you can use existing data dictionary information to analyze database size, workload, and hardware specifications. This analysis helps determine the optimal number of tasks and table groupings for efficient database migration using AWS DMS. By combining database object details with source database hardware information, you can make educated decisions during the migration design phase. Additionally, we recommend referring to Best practices for AWS Database Migration Service to fine-tune your migration tasks.
We invite you to share your feedback, comments, and questions in the comments section.
About the Authors
Manojit Saha Sardar is a Senior Database Engineer with Amazon Web Services and Subject Matter Expert for AWS DMS, Amazon RDS and Amazon RDS for PostgreSQL. He has over a decade of experience in working with relational databases. At AWS, he works with the customers to handle a variety of migration scenarios and assist them in troubleshooting issues related to Amazon RDS for Oracle as well as Amazon RDS for PostgreSQL.
Chirantan Pandya is a Database Engineer with Amazon Web Services and Subject Matter Expert for AWS DMS. He is based out of Dallas, Texas, and specializes in databases migration and has over 15 years of experience. At AWS, he collaborates closely with our customers to provide guidance and technical assistance on database migration projects.