AWS Big Data Blog
Trigger cross-region replication of pre-existing objects using Amazon S3 inventory, Amazon EMR, and Amazon Athena
In Amazon Simple Storage Service (Amazon S3), you can use cross-region replication (CRR) to copy objects automatically and asynchronously across buckets in different AWS Regions. CRR is a bucket-level configuration, and it can help you meet compliance requirements and minimize latency by keeping copies of your data in different Regions. CRR replicates all objects in the source bucket, or optionally a subset, controlled by prefix and tags.
Objects that exist before you enable CRR (pre-existing objects) are not replicated. Similarly, objects might fail to replicate (failed objects) if permissions aren’t in place, either on the IAM role used for replication or the bucket policy (if the buckets are in different AWS accounts).
In our work with customers, we have seen situations where large numbers of objects aren’t replicated for the previously mentioned reasons. In this post, we show you how to trigger cross-region replication for pre-existing and failed objects.
Methodology
At a high level, our strategy is to perform a copy-in-place operation on pre-existing and failed objects. This operation uses the Amazon S3 API to copy the objects over the top of themselves, preserving tags, access control lists (ACLs), metadata, and encryption keys. The operation also resets the Replication_Status
flag on the objects. This triggers cross-region replication, which then copies the objects to the destination bucket.
To accomplish this, we use the following:
- Amazon S3 inventory to identify objects to copy in place. These objects don’t have a replication status, or they have a status of FAILED.
- Amazon Athena and AWS Glue to expose the S3 inventory files as a table.
- Amazon EMR to execute an Apache Spark job that queries the AWS Glue table and performs the copy-in-place operation.
Object filtering
To reduce the size of the problem (we’ve seen buckets with billions of objects!) and eliminate S3 List operations, we use Amazon S3 inventory. S3 inventory is enabled at the bucket level, and it provides a report of S3 objects. The inventory files contain the objects’ replication status: PENDING, COMPLETED, FAILED, or REPLICA. Pre-existing objects do not have a replication status in the inventory.
Interactive analysis
To simplify working with the files that are created by S3 inventory, we create a table in the AWS Glue Data Catalog. You can query this table using Amazon Athena and analyze the objects. You can also use this table in the Spark job running on Amazon EMR to identify the objects to copy in place.
Copy-in-place execution
We use a Spark job running on Amazon EMR to perform concurrent copy-in-place operations of the S3 objects. This step allows the number of simultaneous copy operations to be scaled up. This improves performance on a large number of objects compared to doing the copy operations consecutively with a single-threaded application.
Account setup
For the purpose of this example, we created three S3 buckets. The buckets are specific to our demonstration. If you’re following along, you need to create your own buckets (with different names).
We’re using a source bucket named crr-preexisting-demo-source
and a destination bucket named crr-preexisting-demo-destination
. The source bucket contains the pre-existing objects and the objects with the replication status of FAILED. We store the S3 inventory files in a third bucket named crr-preexisting-demo-inventory
.
The following diagram illustrates the basic setup.
You can use any bucket to store the inventory, but the bucket policy must include the following statement (change Resource
and aws:SourceAccount
to match yours).
In our example, we uploaded six objects to crr-preexisting-demo-source
. We added three objects (preexisting-*.txt
) before CRR was enabled. We also added three objects (failed-*.txt
) after permissions were removed from the CRR IAM role, causing CRR to fail.
Enable S3 inventory
You need to enable S3 inventory on the source bucket. You can do this on the Amazon S3 console as follows:
On the Management tab for the source bucket, choose Inventory.
Choose Add new, and complete the settings as shown, choosing the CSV format and selecting the Replication status check box. For detailed instructions for creating an inventory, see How Do I Configure Amazon S3 Inventory? in the Amazon S3 Console User Guide.
After enabling S3 inventory, you need to wait for the inventory files to be delivered. It can take up to 48 hours to deliver the first report. If you’re following the demo, ensure that the inventory report is delivered before proceeding.
Here’s what our example inventory file looks like:
You can also look on the S3 console on the objects’ Overview tab. The pre-existing objects do not have a replication status, but the failed objects show the following:
Register the table in the AWS Glue Data Catalog using Amazon Athena
To be able to query the inventory files using SQL, first you need to create an external table in the AWS Glue Data Catalog. Open the Amazon Athena console at https://console.aws.amazon.com/athena/home.
On the Query Editor tab, run the following SQL statement. This statement registers the external table in the AWS Glue Data Catalog.
After creating the table, you need to make the AWS Glue Data Catalog aware of any existing data and partitions by adding partition metadata to the table. To do this, you use the Metastore Consistency Check utility to scan for and add partition metadata to the AWS Glue Data Catalog.
To learn more about why this is required, see the documentation on MSCK REPAIR TABLE and data partitioning in the Amazon Athena User Guide.
Now that the table and partitions are registered in the Data Catalog, you can query the inventory files with Amazon Athena.
The results of the query are as follows.
The query returns all rows in the S3 inventory for a specific delivery date. You’re now ready to launch an EMR cluster to copy in place the pre-existing and failed objects.
Note: If your goal is to fix FAILED objects, make sure that you correct what caused the failure (IAM permissions or S3 bucket policies) before proceeding to the next step.
Create an EMR cluster to copy objects
To parallelize the copy-in-place operations, run a Spark job on Amazon EMR. To facilitate EMR cluster creation and EMR step submission, we wrote a bash script (available in this GitHub repository).
To run the script, clone the GitHub repo. Then launch the EMR cluster as follows:
Note: Running the bash script results in AWS charges. By default, it creates two Amazon EC2 instances, one m4.xlarge and one m4.2xlarge. Auto-termination is enabled so when the cluster is finished with the in-place copies, it terminates.
The script performs the following tasks:
- Creates the default EMR roles (
EMR_EC2_DefaultRole
andEMR_DefaultRole
). - Uploads the files used for bootstrap actions and steps to Amazon S3 (we use
crr-preexisting-demo-inventory
to store these files). - Creates an EMR cluster with Apache Spark installed using the
create-cluster
After the cluster is provisioned:
- A bootstrap action installs boto3 and awscli.
- Two steps execute, copying the Spark application to the master node and then running the application.
The following are highlights from the Spark application. You can find the complete code for this example in the amazon-s3-crr-preexisting-objects repo on GitHub.
Here we select records from the table registered with the AWS Glue Data Catalog, filtering for objects with a replication_status
of "FAILED"
or “”
.
We call the copy_object
function for each key
returned by the previous query.
Note: The Spark application adds a forcedreplication
key to the objects’ metadata. It does this because Amazon S3 doesn’t allow you to copy in place without changing the object or its metadata.
Verify the success of the EMR job by running a query in Amazon Athena
The Spark application outputs its results to S3. You can create another external table with Amazon Athena and register it with the AWS Glue Data Catalog. You can then query the table with Athena to ensure that the copy-in-place operation was successful.
The results appear as follows on the console.
Although this shows that the copy-in-place operation was successful, CRR still needs to replicate the objects. Subsequent inventory files show the objects’ replication status as COMPLETED. You can also verify on the console that preexisting-*.txt
and failed-*.txt
are COMPLETED.
It is worth noting that because CRR requires versioned buckets, the copy-in-place operation produces another version of the objects. You can use S3 lifecycle policies to manage noncurrent versions.
Conclusion
In this post, we showed how to use Amazon S3 inventory, Amazon Athena, the AWS Glue Data Catalog, and Amazon EMR to perform copy-in-place operations on pre-existing and failed objects at scale.
Note: Amazon S3 batch operations is an alternative for copying objects. The difference is that S3 batch operations will not check each object’s existing properties and set object ACLs, storage class, and encryption on an object-by-object basis. For more information, see Introduction to Amazon S3 Batch Operations in the Amazon S3 Console User Guide.
About the Authors
Michael Sambol is a senior consultant at AWS. He holds an MS in computer science from Georgia Tech. Michael enjoys working out, playing tennis, traveling, and watching Western movies.
Chauncy McCaughey is a senior data architect at AWS. His current side project is using statistical analysis of driving habits and traffic patterns to understand how he always ends up in the slow lane.