AWS Database Blog

Best practices for exporting and importing data from Amazon Aurora MySQL to Amazon S3

You can build highly distributed applications using a multitude of purpose-built databases by decoupling complex applications into smaller pieces, which allows you to choose the right database for the right job. Amazon Aurora is the preferred choice for OLTP workloads. Aurora makes it easy to set up, operate, and scale a relational database in the cloud.

This post demonstrates how you can export and import the data from Aurora MySQL to Amazon Simple Storage Service (Amazon S3) and shares associated best practices. Export import data to Amazon S3 feature is also available for Amazon Aurora PostgreSQL. This blog post focuses on Amazon Aurora MySQL.

Overview of Aurora and Amazon S3

Aurora is a MySQL and PostgreSQL-compatible relational database built for the cloud, which combines the performance and availability of traditional enterprise databases with the simplicity and cost-effectiveness of open-source databases.

Amazon S3 is an object storage service that offers industry-leading scalability, data availability, security, and performance. This means customers of all sizes and industries can use it to store and protect any amount of data.

Prerequisites

Before you get started, complete the following:

  1. Launch an Aurora MySQL DB cluster. You can also use an existing cluster.
  2. Launch an Amazon EC2 instance that you installed the MySQL client on. You can also use MySQL Workbench for this purpose.
  3. Create the following required Identity and Access Management (IAM) policies and roles:
    • Create an IAM policy with the least-restricted privilege to the resources in the following code and name it aurora-s3-access-pol. The policy must have access to the S3 bucket where the files are stored (for this post, sample-loaddata01).
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Effect": "Allow",
                  "Action": [
                      "s3:GetObject",
                         "s3:AbortMultipartUpload",
                         "s3:DeleteObject",
                         "s3:ListMultipartUploadParts",
                         "s3:PutObject",
                         "s3:ListBucket"
                  ],
                  "Resource": [
                      "arn:aws:s3:::sample-loaddata01/*",
                      "arn:aws:s3:::sample-loaddata01"
                  ]
              }
          ]
      }
    • Create an IAM role and modify the trust relationship according to the following code. AssumeRole allows Aurora to access other AWS services on your behalf.
      {
      "Version": "2012-10-17",
        "Statement": [
          {
            "Effect": "Allow",
            "Principal": {
              "Service": "rds.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
          }
        ]
      }
    • Attach the policy aurora-s3-access-pol from the previous step. For this post, the role is named s3-demo-access-role and the role ARN is arn:aws:iam::123456789012:role/aurora-s3-access-role.
  1. Create and assign parameter groups for the cluster:
    • Create a custom parameter group using the AWS Command Line Interface (AWS CLI). See the following code:
      aws rds create-db-cluster-parameter-group \
      --db-cluster-parameter-group-name aurora-cluster-para-grp \
      --db-parameter-group-family aurora5.6 \
      --description "cluster parameter group for s3 copy" 
    • Modify the parameter group to edit the value of the aurora_default_s3_role parameter to the IAM role s3-demo-access-role. See the following code:
      aws rds modify-db-cluster-parameter-group \
        --db-cluster-parameter-group-name aurora-cluster-para-grp \
        --parameters "ParameterName=aurora_default_s3_role,ParameterValue=arn:aws:iam::123456789012:role/aurora-s3-access-role,ApplyMethod=immediate"
      

      You can map the IAM role to the aurora_select_into_s3_role parameter to allow only data export or map the aurora_load_from_s3_role parameter to allow only data import from the S3 bucket. If the IAM role isn’t specified for these two parameters, Aurora uses the IAM role specified in aws_default_s3_role for both export and import.

  1. Modify the Aurora cluster and associate the newly created parameter group. See the following code:
    aws rds modify-db-cluster \
      --db-cluster-identifier database-1 \
      --db-cluster-parameter-group-name aurora-cluster-para-grp \
      --apply-immediately

    The apply-immediately code triggers modification and asynchronously applies changes as soon as possible regardless of the PreferredMaintenanceWindow on the DB cluster.

  1. Associate the IAM role to the cluster. This allows database users in an Aurora DB cluster to access the S3 bucket. See the following code:
    aws rds add-role-to-db-cluster \
        --db-cluster-identifier database-1 \
        --role-arn arn:aws:iam::123456789012:role/aurora-s3-access-role
  2. Reboot the primary instance in the cluster to apply the changes. See the following code:
    aws rds reboot-db-instance \
          --db-instance-identifier database-1-instance-1

    You’re now ready to explore the following use cases of exporting and importing data.

Exporting data from Aurora MySQL to Amazon S3

To export your data, complete the following steps:

  1. Connect to the cluster as the master user. By default, the master user has permission to export and import data from Amazon S3. For this post, you create a test user with the least-required permission to export data to the S3 bucket. See the following code:
    $ mysql -h database-1.cluster-xxxxxxxxxxxx.us-west-2.rds.amazonaws.com -D sampledb01 -P 3306 -u admin01 -p
  1. Create the user testuser and grant the required privileges to the user on the database. This post uses sampledb01. See the following code:
    MySQL> create user 'testuser'@'%' identified by 'Password';
    MySQL> grant select on sampledb01.* to 'testuser'@'%';
  1. Grant the SELECT INTO S3 privilege to testuser with the following code:
    MySQL> grant SELECT INTO S3 on *.* to 'testuser'@'%';
  1. Log in to the cluster as testuser:
    $ mysql -h database-1.cluster-xxxxxxxxxxxx.us-west-2.rds.amazonaws.com -D sampledb01 -P 3306 -u testuser -p
  1. For this post, you export the lineitem table to the S3 bucket. You can use any existing table in your database (preferably small size) to test this feature. See the following code:
    MySQL> describe lineitem;
    +-----------------+-------------+------+-----+---------+-------+
    | Field           | Type        | Null | Key | Default | Extra |
    +-----------------+-------------+------+-----+---------+-------+
    | orderkey        | int(11)     | NO   |     | NULL    |       |
    | linenumber      | int(11)     | NO   |     | NULL    |       |
    | custkey         | int(11)     | NO   |     | NULL    |       |
    | partkey         | int(11)     | NO   |     | NULL    |       |
    | suppkey         | int(11)     | NO   |     | NULL    |       |
    | orderdate       | int(11)     | NO   |     | NULL    |       |
    | orderpriority   | varchar(15) | NO   |     | NULL    |       |
    | shippriority    | varchar(1)  | NO   |     | NULL    |       |
    | quantity        | int(11)     | NO   |     | NULL    |       |
    | extendedprice   | int(11)     | NO   |     | NULL    |       |
    | ordertotalprice | int(11)     | NO   |     | NULL    |       |
    | discount        | int(11)     | NO   |     | NULL    |       |
    | revenue         | int(11)     | NO   |     | NULL    |       |
    | supplycost      | int(11)     | NO   |     | NULL    |       |
    | tax             | int(11)     | NO   |     | NULL    |       |
    | commitdate      | int(11)     | NO   |     | NULL    |       |
    | lshipmode       | varchar(10) | NO   |     | NULL    |       |
    +-----------------+-------------+------+-----+---------+-------+ 

    The following screenshot displays a few records from the lineitem table.

  1. To export data from the Aurora table to the S3 bucket, use the SELECT INTO OUTFILE S3 The following statement exports the entire table to the S3 bucket. If you’re trying this feature out for the first time, consider using the LIMIT clause for a larger table. See the following code:
    MySQL> SELECT * FROM lineitem INTO OUTFILE S3 
    's3://sample-loaddata01/unload-data/file'
    FIELDS TERMINATED BY ','
    LINES TERMINATED BY '\n';
    Query OK, 50000000 rows affected (3 min 37.05 sec)

    The following code shows the exported file in the S3 bucket. The default file size threshold is 6 GB. Because the data selected by the statement is less than the file size threshold, a single file is created.

    $aws s3 ls s3://sample-loaddata01/unload-data --recursive --human-readable --summarize
    2020-04-29 00:10:17    4.7 GiB unload-data/file.part_00000

    If you need the file size to be smaller than 6 GB, you can identify a column to split the table data into small portions and run multiple SELECT INTO OUTFILE statements (using the WHERE condition). It’s best to do this when the amount of data selected is more than 25 GB.

Importing data from Amazon S3 to Aurora MySQL

In this section, you load the data back to the Aurora table from the Amazon S3 file. Complete the following steps:

  1. Grant the LOAD FROM S3 privilege to testuser. See the following code:
    MySQL> grant LOAD FROM S3 on *.* to 'testuser'@'%';
  1. Use LOAD DATA FROM S3 to import the data file from an Amazon S3 prefix. See the following code:
    MySQL > LOAD DATA FROM S3 PREFIX 's3://sample-loaddata01/unload-data/file'
    INTO TABLE lineitem 
    FIELDS TERMINATED BY ','   
    LINES TERMINATED BY '\n'
    (orderkey,linenumber,custkey,partkey,suppkey,orderdate,orderpriority,shippriority,quantity,extendedprice,ordertotalprice,discount,revenue,supplycost,tax,commitdate,shipmode);
    Query OK, 50000000 rows affected ( 8 min 16 sec)
  1. Verify the loaded files using the aurora_s3_load_history system table. See the following code:
    MySQL> select * from mysql.aurora_s3_load_history order by load_timestamp desc limit 1\G;
    
    *************************** 1. row ***********************
      load_prefix: s3://sample-loaddata01/unload-data/file
         file_name: unload-data/file.part_00000
    version_number: 
      bytes_loaded: 5052058881
    load_timestamp: 2020-05-05 18:09:53

Best practices

This section discusses a few best practices for bulk loading large datasets from Amazon S3 to your Aurora MySQL database. This post bases these observations on a series of tests loading 50 million records to the lineitem table on a db.r4.4xlarge instance (see the preceding sections for table structure and example records).

Load testing was carried out when no other active transactions were running on the cluster. This might change depending on your cluster loads and instance type.

File size

Make sure that the source files in the S3 bucket aren’t too small. Loading several small files (1–10 MB) adds overhead of file contention and impacts the load performance. For optimal performance, consider a file size between 100 MB–1 GB.

The following graph shows the import time versus file size while loading 50 million records to the lineitem table on a db.r4.4xlarge instance. It’s quite evident that load performance falls very sharply with small files, but doesn’t meaningfully increase with very large files.

Partitioning

If you’re using partitioned tables, consider loading partitions in parallel to improve load performance. Load from Amazon S3 supports explicit partition selection and only locks the partition where data is being loaded. Partitioning the lineitem table on the linenumber column and loading all partitions concurrently shows significant improvement in load time. See the following code:

.
PARTITION BY List(linenumber) (
    PARTITION p1 VALUES IN (1),
    PARTITION p2 VALUES IN (2),
    PARTITION p3 VALUES IN (3),
    PARTITION p4 VALUES IN (4),
    PARTITION p5 VALUES IN (5),
    PARTITION p6 VALUES IN (6),
    PARTITION p7 VALUES IN (7)
);

Without a partition strategy, the approximate load time is 8 minutes, 20 seconds. With the list (linenumber) partition strategy and concurrent load, the approximate load time is 2 minutes, 30 seconds.

Concurrent loads

To improve load performance, you can load multiple tables concurrently. However, the degree of concurrency can impact the other transactions running on the cluster. The number of vCPUs allocated to instance types plays a key role because load operation requires CPU cycles to read data, insert into tables, commit changes, and more. For more information, see Amazon RDS Instance Types. Loading several tables concurrently with few vCPUs can cause CPU utilization to spike and may impact the existing workload. For more information, see Overview of Monitoring Amazon RDS.

The following Amazon CloudWatch metric illustrates the optimal CPU utilization when eight concurrent loads were running on a db.r4.8xlarge (32 vCPU) cluster.

The following Amazon RDS Performance Insights graph was taken during the eight concurrent loads. From the graph, it’s quite evident that majority of time is spent consuming basic host resources like CPU and I/O. The dominant contributor to DB load is the wait event io/table/sql/handler, which suggests the storage engine is processing I/O requests against tables.

For more information about wait events, see Aurora MySQL Events. For more information about using Performance Insights, see Tuning Amazon RDS for MySQL with Performance Insights.

Another option is to split the files into S3 buckets and run concurrent loads on a single non-partitioned table. This slightly boosts the load time compared to running a single load operation. With one concurrent session, the approximate load time to import 150 million rows was 23 minutes, 40 seconds. With three concurrent sessions each importing 50 million records, the approximate load time was 22 minutes, 10 seconds.

Running the export from a reader instance

Exporting a huge amount of data out from Aurora to S3 is an intensive operation. Running the export from a writer instance may impact other transactions running on the cluster. To solve this, you can use a reader instance. Aurora readers are designed to serve read traffic and support the SELECT INTO OUTFILE operation.

Constraints

The data import process often becomes slower when foreign key and unique key checks are enabled. You can turn these parameters off at the session level by setting it to 0 before running the load statements. You should only consider it if you trust the integrity of the data loaded from Amazon S3 and there is no other write workload on the table. To disable the key checks, enter the following code:

SET foreign_key_checks=0;
SET unique_checks=0;

After the load statements are complete, enable the key checks with the following code:

SET foreign_key_checks=1;
SET unique_checks=1;

Don’t modify the table structure (for example, by using ALTER TABLE) while foreign key checks and unique checks are disabled. As a general best practice, assess the impact before disabling these parameters.

Summary

This post demonstrated how to import and export data between Aurora MySQL and Amazon S3. Aurora is powered with operational analytics capabilities, and integration with Amazon S3 makes it easier to establish an agile analytical environment. For more information about importing data, see Loading Data into an Amazon Aurora MySQL DB Cluster from Text Files in an Amazon S3 Bucket. For more information about exporting data, see Saving Data from an Amazon Aurora MySQL DB Cluster into Text Files in an Amazon S3 Bucket. For more information about Aurora best practices, see Best Practices with Amazon Aurora MySQL.

 

 


About the Authors

 

Suresh Patnam is a Senior Big Data Architect at AWS. He works with customers in their journey to the cloud with a focus on big data, data lakes, and data strategy. In his spare time, Suresh enjoys playing tennis and spending time with his family.

 

 

 

Mahesh Goyal is a Data Architect in Big Data at AWS. He works with customers in their journey to the cloud with a focus on big data and data warehouses. In his spare time, Mahesh likes to listen to music and explore new food places with his family.