AWS Database Blog

Turn petabytes of relational database records into a cost-efficient audit trail using Amazon Athena, AWS DMS, Amazon RDS, and Amazon S3

Many AWS manufacturing customers are legally required to keep production records for years—sometimes decades. These audit trails are used to prove compliance with process specifications or to track provenance of parts. Having a complete and correct audit trail is critical to manage warranty-related risks. As a result, modern smart factories generate vast amounts of data at each step of the production process. Audit trail use cases often require the storage of tens or hundreds of terabytes for each plant. In many cases, those audit trail records reside in relational databases on premises. In this post, we show you how an automotive manufacturing customer uses AWS services to meet audit trail requirements with strong security assurances at a fraction of the infrastructure cost of relational databases.

At the start of the project, the customer had one petabyte of manufacturing data spread across dozens of Oracle databases in plants around the globe. Because some of the locations were in remote regions with limited network connectivity, the customer decided to use a fleet of AWS Snowball devices to physically ship the data from on premises to an AWS Region.

The database snapshots were then restored to Amazon Relational Database Service (Amazon RDS) instances for data cleaning and validation. AWS then worked with the customer to develop a strategy to store the records durably, accessibly, and cost-efficiently. A proof of concept showed that compressed archives in Amazon Simple Storage Service (Amazon S3) together with Amazon Athena querying capabilities provide a solution that meets all three requirements.

In this post, we show how you can use AWS Database Migration Service (AWS DMS) to migrate relational data from Amazon RDS into compressed archives on Amazon S3. We discuss partitioning strategies for the resulting archive objects and how to use S3 Object Lock to protect the archive objects from modification. Lastly, we demonstrate how to query the archive objects using SQL syntax through Athena with seconds latency, even on large datasets.

Solution overview

The following diagram illustrates the solution architecture. The audit trail data is stored in Amazon S3.

Solution Architecture

The workflow includes the following steps:

  1. To make sure data is immutable, we first enable S3 Object Lock for the archive bucket.
  2. Then we configure AWS DMS to export data from Amazon RDS to a staging bucket.
  3. Then we copy the data to the archive bucket using S3 Batch Operations.

    An alternative to S3 Batch Operations here is to use AWS Glue if you want more control over the partitioning of the Parquet files. Depending on the expected access frequency, we recommend using S3 Intelligent-Tiering or S3 Glacier Instant Retrieval as the storage tier.

  4. We use Amazon Athena to query both tiers with seconds latency even on large datasets.

AWS DMS supports various source database engines, including both on-premises and cloud-based instances. For a list of sources supported by AWS DMS, see Sources for AWS DMS.

The goal of the solution is to archive large amounts of relational data long term. Simply taking snapshots of the databases and restoring those when needed would have some challenges. You must do the following:

  • Make sure that snapshot versions stay compatible with database engine versions
  • Restore the full database even if you only want to read a single record

Therefore, access times are minutes or even hours depending on the size of the database. And costs both for snapshot storage and database access wouldn’t be optimal.

Instead, the solution our customer implemented transforms the tables of the relational database into compressed Parquet files that are stored in Amazon S3. Athena then provides a SQL query interface to those archives that returns results with seconds latency.

Customer use case

The starting point for our automotive manufacturing customer was a set of relational databases that needed to be retained for several years for compliance reasons. The customer expected that they would need to retrieve relatively small subsets of the data infrequently. A proof of concept with Athena demonstrated that the Athena query interface supports all expected query patterns. Those access scenarios also informed the partitioning schema for the export from large tables into Parquet files much smaller than the original tables.

After the target schema had been designed, the team devised an automation that orchestrated AWS DMS tasks for each relevant table in each of the source databases. The actual migration finished within hours by processing tables in parallel.

Benefits

The solution provides three key benefits over an approach that uses snapshots of databases for long-term archival:

  • Athena queries records in Amazon S3 in place within seconds, even for large datasets.
  • Athena incurs no fixed/hourly cost and is billed by the amount of data scanned. Athena provides a cost-efficient query interface, particularly if you query only a fraction of your dataset.
  • Amazon S3 provides a robust, built-in governance mechanism that protects data against modification: S3 Object Lock. You can configure S3 Object Lock in a way that protects data against modification from any users, including root users, with unrestricted permissions. This helps meet the compliance requirements in many audit trail use cases.

Prerequisites

This solution assumes that you have access to an AWS account, that your source databases are reachable from within your cloud VPC, and that you have read access to all databases that will be archived for long term.

Prepare S3 buckets

The solution uses a staging bucket as the target for the AWS DMS export and an archive bucket for the eventual storage. Any bucket used as a target for AWS DMS should not use versioning for performance reasons. Therefore, we strongly recommend that the staging bucket has versioning disabled. S3 Object Lock, on the other hand, requires versioning. Therefore, the archive bucket must have versioning enabled to allow integrity protection through S3 Object Lock.

The solution replicates data that will be protected by S3 Object Lock between the staging bucket (versioning disabled) and the archive bucket (versioning enabled). S3 Object Lock protects the archive data from deletion or modification. With S3 Object Lock, you can store objects using a write-once-read-many (WORM) model.

Object Lock must be activated at bucket creation and is only required for the archive bucket.

Enabling Object Lock

For audit trails that have high integrity and availability requirements, we recommend using S3 Object Lock in Compliance mode. In Compliance mode, a protected object version can’t be overwritten or deleted by any user, including the root user in your AWS account.

The following screenshot shows an object lock configuration in compliance mode. The retention period has been set to 3 days in the development environment.

Object Lock configuration in Compliance Mode

Depending on your access pattern, use either S3 Intelligent-Tiering for the archive bucket if you expect frequent access to objects, or S3 Glacier Instant Retrieval if you expect infrequent access. The latter provides lower storage cost, whereas the former provides lower retrieval cost.

Use AWS DMS to export RDS databases to Amazon S3

This section describes how to use AWS DMS to export tables from an RDS database into S3 objects using the Parquet file format. We do not provide all steps in detail but instead refer to the DMS documentation.

First, configure a replication instance in the Region that holds the staging bucket, which is the target for AWS DMS. Second, configure a source endpoint for the RDS instance and a target endpoint for the S3 bucket. By default, the replication will work on all tables and views across schemas in a source database. Use table mappings to specify what tables, views, and schemas you want to work with.

For more information about AWS DMS in general, see AWS Database Migration Service Documentation. Refer to Using Amazon S3 as a target for AWS Database Migration Service for instructions on how to set up AWS Identity and Access Management (IAM) roles and permissions for a migration task that reads from Amazon RDS and writes to Amazon S3.

The following screenshot shows the configuration an S3 target endpoint. The additional endpoint settings ensure that S3 objects use the Parquet format instead of the default CSV format.

DMS Endpoint settings for Parquet export

Use the following configuration for the target S3 endpoint settings to generate Parquet file output with compression:

{
   "CsvRowDelimiter": "\\n", 
   "CsvDelimiter": ",", 
   "CompressionType": "GZIP", 
   "DataFormat": "parquet", 
   "ParquetVersion": "parquet-2-0", 
   "EnableStatistics": true, 
   "DatePartitionEnabled": false 
}

Deploy an access point for Amazon S3 and an interface endpoint for AWS Secrets Manager if you are running the migration instance in a private VPC. Those endpoints are needed for private connectivity between the replication instance and the Amazon S3 API or AWS Secrets Manager API, respectively. The following screenshot shows an example configuration of a serverless replication.

DMS migration task with a serverless configuration

After completing this step, you have created an AWS DMS task that exports the tables of a source RDS database to the staging S3 bucket using the Parquet file format. In the following step, you copy the objects into a bucket that has S3 Object Lock to prevent modification.

Migrate the objects into an integrity-protected bucket with S3 Batch Operations

A straightforward and cost-efficient way to copy the objects from the non-versioned staging bucket into the archival bucket is S3 Batch Operations. The S3 Batch Operations job allows you to set S3 Object Lock both in Governance mode and Compliance mode. The following screenshot shows the configuration for S3 Object Lock in Compliance mode with a retention period of 2 years.

S3 Object Lock in Compliance mode

We recommend that you define the S3 Batch Operations job based on the S3 inventory report for the staging bucket and use the Copy operation. The replication operation would require the source (the staging bucket) to have versioning enabled. This would not be in line with best practice recommendations for AWS DMS.

S3 Batch Operations copies objects up to 5 GB in size. Take this into consideration when partitioning your tables into Parquet files.

Alternative: Partition Parquet files with AWS Glue for cost-efficient queries

Athena query performance and cost depend strongly on the amount of data scanned. We recommend organizing the Parquet files so you achieve a good balance between the size of the files and the number of files queried per request. In the following section, we provide one example.

Assume there is a timestamp column in your table that most of your queries condition upon. Further assume that you have partitioned the records of this table by putting all records from one day into a prefix of the form /year=yyyy/month=mm/day=dd/ or /dt=yyyy-mm-dd/. Any query that conditions on the timestamp column would then only process the Parquet files in the relevant prefixes. For more information, see Partitioning data in Athena.

If you know the predominant access pattern for your data, we strongly recommend that you partition your archive objects accordingly. When faced with multiple access patterns, you might have to choose which ones to optimize for, or export your data to Amazon S3 multiple times in different arrangements to accommodate different access patterns. After being exported in one layout, it is a non-trivial amount of work to reorganize the data for a different access pattern later.

To obtain a partition in line with your query pattern, consider the following alternative to the S3 Batch Operations approach: Use AWS Glue to read Parquet files from the staging bucket, process the contents, and write into partitioned Parquet files in the archive bucket. For an example implementation, see Using the Parquet format in AWS Glue.

Access the archived databases using Athena

Athena requires an AWS Glue Data Catalog that provides the schema for the data in your archive bucket. Refer to Getting started with the AWS Glue Data Catalog for instructions to create a database and corresponding tables. You can either add the tables manually or use an AWS Glue crawler. The Data Catalog allows Athena to query the Parquet files using SQL statements in a serverless manner. You can either query Athena directly through the Athena query interface on the Athena console or by using a JDBC or ODBC connection.

The following example queries the shopfloor_events_1 table in the database rds-long-term-archive. The table data is partitioned into one Parquet file per day of approximately 2.5 MB. Those Parquet files are partitioned into prefixes of the form database_1/shopfloor_events_1/yyyy=2022/mm=01/dd=01/. Providing the partition information for yyyy/mm/dd allows Athena to restrict data scanning to a single object and 2.59 MB of scanned data.

The following two screenshot shows the Amazon Athena query editor with the query results both with without and without partition information.

Athena query performance without partitioning

The same query without the additional constraint on yyyy/mm/dd results in scanning 21.48 MB of data and a fourfold increase in runtime.

Athena query performance without partitioning

Because Athena usage is billed based on the amount of data scanned, the first query runs at one eighth of the cost of the second query thanks to the partitioning. It also runs four times faster.

Conclusion

In this post, we showed how you can use Amazon S3 with Amazon Athena for cost-efficient long-term archival of relational databases. This solution has three key advantages over a snapshot-based database archive. First, Athena queries Amazon S3 instantly, whereas a snapshot archive requires that you restore the database instance, which takes minutes or even hours depending on the size of the database. Second, Athena is billed based on the amount of data scanned, whereas a snapshot must always restore the full database before querying for data. And third, S3 Object Lock can effectively protect archived data in Amazon S3 against modification or deletion, even by a user who has administrative privileges. This is particularly relevant for audit trail use cases that have a legal requirement to keep records for a specified amount of time.

AWS announced the Simple Database Archival Solution (SDAS). It supports Oracle, MySQL, and Microsoft SQL Server data sources and provides similar functionality to the one described in this post.

Try SDAS or the approach described in this post and evaluate the cost-savings potential for your relational database archives today.


About the Author

Andreas BognerAndreas Bogner is a Global Solutions Architect at AWS based in Munich, Germany. He works with automotive customers on large-scale cloud projects in the autonomous driving, manufacturing, and supply chain domains. As a mathematician by training, he is obsessed with numbers and efficiency.