Migrate data from Google Drive to Amazon S3 using Rclone

Whether you choose to operate entirely on AWS or in multicloud and hybrid environments, one of the primary reasons to adopt AWS is the broad choice of services we offer, enabling you to innovate, build, deploy, and monitor your workloads.

Amazon S3 is a great option for Google Drive users seeking a comprehensive storage solution. Amazon S3 offers very high durability and availability, virtually unlimited scalability, and performant, cost-effective storage options. With its pay-as-you-go pricing model and storage classes optimized for various access patterns and cost requirements, Amazon S3 caters to diverse needs, from managing mission-critical data to storing data for backup and archive. To migrate data from Google Drive to S3, you can use rclone, an open-source command-line tool that provides a streamlined solution for transferring data.

In this post, we demonstrate how you can use rclone to move data from Google Drive to Amazon S3. We walk you through the process of setting up an Amazon Elastic Compute Cloud (EC2) instance with rclone installed and configured to transfer data from Google Drive to Amazon S3. The majority of this setup process is automated through AWS CloudFormation. We also explore different rclone flags to reduce data transfer time while addressing service quotas. Rclone simplifies the migration process with its native support for both storage systems, enabling synchronization and efficient data transfer. Its customizable configuration options, such as controlling concurrent transfers and setting transaction rate limits, help optimize the transfer process.

Solution overview

The following diagram illustrates the architecture for transferring data from Google Drive to Amazon S3 using an EC2 instance with rclone installed. The EC2 instance, which is provisioned through CloudFormation, acts as an intermediary to facilitate the data transfer process. Rclone, running on the EC2 instance, copies data from Google Drive to Amazon S3.

Figure 1. Architecture diagram to transfer data from Google Drive to Amazon S3 using Rclone

Figure 1: Architecture diagram to transfer data from Google Drive to Amazon S3 using Rclone

By using CloudFormation, the deployment and configuration of the EC2 instance and rclone setup are automated. This makes sure of a consistent and reproducible environment while minimizing manual error, saving time, and reducing effort. Rclone, configured on the EC2 instance, connects to both Google Drive and Amazon S3, thus enabling efficient data transfer from Google Drive to S3.

Prerequisites

Following are the prerequisites you will require to implement this solution:

Make sure you have necessary permissions to access AWS Identity and Access Management (IAM), AWS Secrets Manager, and Amazon EC2.
You should have Amazon Virtual Private Cloud (Amazon VPC), a subnet for your VPC, and an EC2 key pair. These are necessary as user input selections when you set up the CloudFormation stack.
Decide between the x86 and AWS Graviton-based instances (ARM), and choose the EC2 instance type that’s offered in your subnet’s Availability Zone (AZ). Refer to the user guide for finding an EC2 instance type to filter by architecture and AZ. For best price-performance, we recommend the ARM instance unless you have a specific requirement for an x86.
Rclone suggests making your own Google Drive API Client ID instead of using the default one. Refer to the rclone documentation for creating your own Google Drive Client ID and to learn why it’s recommended.
You need an S3 bucket as a destination data store.
You should have rclone available on a machine that has a web browser. Install rclone on your local system.

Review the CloudFormation template to understand IAM user permissions and adjust as necessary. Refer to IAM policies best practices for more details. Similarly, check and update security groups for the EC2 instance if needed.

The CloudFormation automates the rclone setup for Amazon S3, but Google Drive needs manual token authorization after connecting to the EC2 instance. We cover this later in the post.

Walkthrough

Transferring data from Google Drive to Amazon S3 in this post involves:

1. Deploying the CloudFormation template, which provisions an EC2 instance, installs rclone, and configures rclone remote connections.

2. Completing Google Drive token authorization after connecting to the EC2 instance.

3. Using rclone commands to transfer data from Google Drive to Amazon S3.

In the following sections we go through these steps in more detail.

1. CloudFormation stack deployment

This section will walk you through deploying the CloudFormation template to create necessary resources for data transfer.

1.1. Create stack

1.1.1. Download the CloudFormation template, CFT-GDrive-to-S3.yaml, designed for this blog solution here, and then visit the CloudFormation console.

1.1.2. On the Stacks page, choose Create stack at top right, and then choose With new resources (standard).

1.1.3. On the Create stack page, select Choose an existing template in the prepare template section. In Specify template section, select Upload a template file > Choose file to select a CloudFormation template, CFT-GDrive-to-S3.yaml, you downloaded earlier.

1.1.4. Select Next.

1.2. Specify stack details

Figure 2 - CloudFormation stack configuration - Gdrive to S3

Figure 2: CloudFormation stack configuration

1.2.1. Provide a unique Stack name, as shown in the preceding figure. For example, “Rclone-GDrive-to-S3.”

1.2.2. Select the VPC and Subnet in which to create your EC2 instance.

1.2.3. Enter your preferred Instance Type. Follow the instructions given in the “Prerequisites” section for details on selecting the compatible instance type.

1.2.4. Select the EC2 key pair.

1.2.5. Enter into the Your IP Address range for where you want to allow inbound traffic to your instance.

1.2.6. Enter the Client ID and Client secret that you created from the Google Drive API Console in the previous steps.

1.2.7. Enter your Amazon S3 Bucket Name that you would like to transfer your data into from Google Drive, and select

1.3. Configure stack options

Leave all the options as default and scroll down. Select the check mark at the bottom of the page in the Capabilities section. Then choose to agree to the acknowledgement of creating IAM resources, and select Submit.

1.4. Review and create

Review configurations, scroll down, and select Submit.

2. Google Drive authorization token

This section guides you through the process of authorizing rclone to access your Google Drive account.

2.1. Connecting to Google Drive Remote

After completing the CloudFormation deployment, connect to your EC2 instance: “Rclone Instance – GDrive to S3.”

The first step after connecting to the EC2 instance is to authorize rclone for Google Drive access. This is an additional security step along with providing Google Drive client_id and client_secret for rclone access.

Run the following command and enter n because you are working on a remote or headless machine. We are selecting this option because you don’t have a browser in your Ubuntu-based EC2 instance.

rclone config reconnect gdrive-remote:

Figure 3 - Configuring rclone to connect to Google Drive remote (GDrive to S3)

Figure 3: Configuring rclone to connect to Google Drive remote

Copy the command that starts with rclone authorize “drive” as shown in Figure 4.

Figure 4 - Getting rclone config token from EC2 to use in a local machine that has a web browser (Gdrive to S3)

Figure 4: Accessing the link given by rclone to authenticate with Google Drive

Run it in your local system where rclone is setup as shown in Figure 5.

Figure 5 - Authorizing rclone to access Drive on a local system using the config token from EC2 (Gdrive to S3)

Figure 5. Authorizing rclone to access Google Drive on a local system using the config token from EC2

2.2. Authorization with Google Drive

Your browser will open with Google asking you to Choose an account for Google Drive access. Select the relevant account for which you would like to transfer the data from. The next prompt would be rclone wants to access your Google Account > Continue. After selecting Continue, you will see a message on your browser showing “Success! All done. Please go back to rclone.”

Now go back to your local system, copy the code that rclone provides you as shown in the following figure that says Paste the following into your remote machine.

Figure 6 - Getting code from local system after authorizing rclone to access Drive on browser (Gdrive to S3)

Figure 6: Getting code from local system after authorizing rclone to access Drive on browser

Access the Ubuntu rclone EC2 instance, and paste the code you copied in the config token> field as shown in Figure 7.

Rclone provides a prompt asking if you want to Configure this as a Shared Drive (team drive), as shown in the image. I entered n, because I would not like to configure this as a team drive (shared drive). Enter your choice that is appropriate to your use case.

Figure 7 - Authorizing rclone on EC2 to complete Google Drive authorization (Gdrive to S3)

Figure 7: Authorizing rclone on EC2 to complete Google Drive authorization

2.3. Rclone remote connections configuration

When provisioning the EC2 instance with the CloudFormation template, an initial rclone config file was created for the Amazon S3 and Google Drive configurations. Running the following code shows this initial configuration file, as shown in the following image.

nano /home/ubuntu/.config/rclone/rclone.conf

Figure 8 - Rclone configuration file (Gdrive to S3)

Figure 8: Rclone configuration file

If you’d like to create or update configurations according to your requirements, then you can directly edit this configuration file by going to the file path. Alternatively, you can enter an interactive configuration session after connecting to your instance by running the rclone config command and then editing or updating from there, as shown in the following image:

Figure 8. Editing Rclone configuration

Figure 9: Editing Rclone configuration

Refer to the Amazon S3 rclone configuration and Google Drive rclone configuration for detailed instructions on customizing remotes to meet your needs.

3. Transferring and managing files between Google Drive and Amazon S3

In this section, you will see how to use rclone to transfer data from Google Drive to Amazon S3 and perform various file management operations.

For this post, Google Drive and Amazon S3 remote connections are named gdrive-remote and s3-remote. You can run the following command to list the objects in the specified path.

rclone ls <remote>:<folder_name>/<subfolder_name>

rclone ls <remote>:<bucket_name>/<folder_name>

Figure 9. Listing objects within specified paths from remotes using Rclone list command

Figure 10: Listing objects within specified paths from remotes using rclone list command

Start by copying files from one remote to another. The rclone copy command copies files from source to destination. For test purposes, consider transferring around 1 TB of data from Google Drive to an S3 bucket.

The direct command to copy is as follows:

rclone copy <source>:<sourcepath> <dest>:<destpath>

However, you need certain conditions in place to address the quotas from the storage service providers while also optimizing transfer speed. Rclone flags add extra functionality to Rclone commands, enabling you to manage data across remotes more efficiently. By using the appropriate flags, you can have a balance between adhering to the quotas and minimizing the time needed for data transfers.

Now you can start with the code, understand how these flags work, and finally implement them to transfer data.

rclone copy \
--tpslimit 200 \
--transfers 200 \
--buffer-size 200M \
--checkers 400 \
--s3-upload-cutoff 100M \
--s3-chunk-size 100M \
--s3-upload-concurrency 50 \
gdrive-remote: \
s3-remote: EXAMPLE-DESTINATION-BUCKET \
-P

The -P/--progress flag helps you view real-time transfer statistics during file operations. The following flags help with different things:

Transactions per second – Rclone Flag

--tpslimit helps you specify or limit the number of transactions per second (TPS). A transaction or a query in this case can be a PUT/GET/ POST if it’s an HTTP backend.

Transfers flag

--transfers controls the number of simultaneous file transfers. By default, rclone performs four parallel transfers.

Buffer size flag

--buffer-size=SIZE specifies the buffer size for each transfer to improve transfer speed. Each --transfer uses the specified amount of memory for buffering.

Checkers flag

--checkers=N controls the number of parallel file checkers during operations, such as copying files. Checkers verify file integrity and make sure of correct transfers.

Rclone supports Amazon S3 multipart upload. Multipart upload enables the uploading of a single object as multiple parts, each representing a continuous portion of the object’s data. It’s recommended to use multipart uploads instead of single operations when the object size exceeds 100 MB. Refer to the multipart uploads with rclone documentation for more information.

Google Drive quotas

As Google Drive API is a shared service with multiple users, Google sets certain quotas, such as the number of transactions/ queries per a defined time.

You can find more info in the documentation on usage limits from Google Drive.

To check your quotas, refer to the Google cloud view and manage quotas documentation, or you can follow these steps:

1. Visit the Google API Console.

2. Select your project. In the Enabled APIs & Services section, select Google Drive API.

3. Under QUOTAS & SYSTEM LIMITS, you can see properties such as “Queries per minute” and “Queries per minute per user.”

In our case, we have “Queries per minute” and “Queries per minute per user” at 12,000, which roughly translates to 200 per second.

Amazon S3 quotas

On the other hand, Amazon S3 allows significantly higher requests. It allows at least 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per partitioned Amazon S3 prefix, with no limit on the number of prefixes per bucket.

During scaling, temporary 503 errors (slow down) may occur, but they subside once scaling completes. Refer to the best practices design patterns for optimizing Amazon S3 performance.

To make sure that you stay within the quota limits, consider values from both remotes and set the --tpslimit to the lower of the two. Since Google Drive allows only 200 queries per second in this case, set the flag as --tpslimit 200.

Figure 10.1. Rclone copy command with the flags to copy the data from Google Drive to the S3 bucket

Figure 10.1: Rclone copy command with the flags to copy the data from Google Drive to the S3 bucket

Now you can implement the preceding commands. The following image shows that transferring 1001.06 GB of data from Google Drive to Amazon S3 took approximately 10 minutes and 15.7 seconds.

Figure 10.2. Output of the copy command showing the details about the transfer

Figure 10.2: Output of the copy command showing the details about the transfer

check: Verifies that the files in the source and destination are identical.

rclone check gdrive-remote: EXAMPLE-DESTINATION-BUCKET -P

Figure 11. Rclone check command

Figure 11. Rclone check command

delete: Deletes files from the given path.

rclone delete gdrive-remote:gdrive-test-folder-2 -P

Figure 12. Rclone delete command

Figure 12. Rclone delete command

Refer to the rclone commands documentation for a complete list of available commands, such as rclone sync if you’d like to sync the source with the destination.

Monitoring EC2 instance

Refer to the rclone documentation to learn more about these flags and when to adjust their values. The different flags discussed in this post primarily use the network bandwidth and memory of the EC2 instance. We recommend experimenting with different flag values while monitoring your instance’s performance to achieve optimal results without exceeding the throttle limits of your EC2 instance.

For example, if you run the preceding command to test transferring 1 TB of data, and then you noticed that your EC2 instance’s CPU usage is 30%, memory usage is 60%, and network usage is 70%. Ideally you would want to stay under 100%, for example a safer value of around 90%. Therefore, these metrics indicate that there is room for growth to increase the values for flags, such as --transfers to enhance the transfer process.

You can monitor your EC2 instance using Amazon CloudWatch, an application performance monitoring service.

Other ways to optimize the data transfer process

The following methods can help you further improve data transfer process:

1. Higher performance compute: A direct option is to select an EC2 instance type with higher performance. Try to have a balance among the following factors: Amazon EC2 usage time, instance cost, and data transfer completion time.

2. TPS quotas: One way to address TPS quotas is that you could request a quota increase from Google.

3. Other relevant rclone flags: You can look into the following flags and evaluate if they are relevant to your preferences. Note that a few of these flags at times may cross the quota values such as TPS, so consider them with caution:

Cleaning up

You may want to delete the resources created in this post to avoid unwanted future charges. To delete the stack resources, you can delete the CloudFormation stack. In addition, visit the Google Cloud Console credentials page and delete the OAuth credentials that you created earlier.

Conclusion

In this post, we explored how to efficiently transfer large amounts of data from Google Drive to Amazon S3 using Rclone. We automated the setup process of EC2 instance creation, Rclone installation, and remote connections configuration through AWS CloudFormation, reducing manual effort and potential errors. We also explored various Rclone flags that optimize transfer times while staying within service quotas, helping you avoid throttling and delays. This approach allows you to customize the data transfer process to match your specific requirements, ensuring efficient and reliable migrations even for large datasets.

By moving your data to Amazon S3, you can take advantage of cost savings, and its scalability and performance make it a great choice for your data lakes, analytical, and ML/AI applications. Additionally, Amazon S3 provides extensive security features and supports numerous compliance regulations, enhancing data protection and governance.

Feel free to check out other posts that might be helpful in migrating data and monitoring:

Thank you for reading this post. If you have any comments, feel free to post them in the comments section.

AWS Storage Blog