AWS Storage Blog

Migrate data from Google Drive to Amazon S3 using Rclone

Whether you choose to operate entirely on AWS or in multicloud and hybrid environments, one of the primary reasons to adopt AWS is the broad choice of services we offer, enabling you to innovate, build, deploy, and monitor your workloads.

Amazon S3 is a great option for Google Drive users seeking a comprehensive storage solution. Amazon S3 offers very high durability and availability, virtually unlimited scalability, and performant, cost-effective storage options. With its pay-as-you-go pricing model and storage classes optimized for various access patterns and cost requirements, Amazon S3 caters to diverse needs, from managing mission-critical data to storing data for backup and archive. To migrate data from Google Drive to S3, you can use rclone, an open-source command-line tool that provides a streamlined solution for transferring data.

In this post, we demonstrate how you can use rclone to move data from Google Drive to Amazon S3. We walk you through the process of setting up an Amazon Elastic Compute Cloud (EC2) instance with rclone installed and configured to transfer data from Google Drive to Amazon S3. The majority of this setup process is automated through AWS CloudFormation. We also explore different rclone flags to reduce data transfer time while addressing service quotas. Rclone simplifies the migration process with its native support for both storage systems, enabling synchronization and efficient data transfer. Its customizable configuration options, such as controlling concurrent transfers and setting transaction rate limits, help optimize the transfer process.

Solution overview

The following diagram illustrates the architecture for transferring data from Google Drive to Amazon S3 using an EC2 instance with rclone installed. The EC2 instance, which is provisioned through CloudFormation, acts as an intermediary to facilitate the data transfer process. Rclone, running on the EC2 instance, copies data from Google Drive to Amazon S3.

Figure 1. Architecture diagram to transfer data from Google Drive to Amazon S3 using Rclone

Figure 1: Architecture diagram to transfer data from Google Drive to Amazon S3 using Rclone

By using CloudFormation, the deployment and configuration of the EC2 instance and rclone setup are automated. This makes sure of a consistent and reproducible environment while minimizing manual error, saving time, and reducing effort. Rclone, configured on the EC2 instance, connects to both Google Drive and Amazon S3, thus enabling efficient data transfer from Google Drive to S3.

Prerequisites

Following are the prerequisites you will require to implement this solution:

Review the CloudFormation template to understand IAM user permissions and adjust as necessary. Refer to IAM policies best practices for more details. Similarly, check and update security groups for the EC2 instance if needed.

The CloudFormation automates the rclone setup for Amazon S3, but Google Drive needs manual token authorization after connecting to the EC2 instance. We cover this later in the post.

Note that the sample solution in this post is designed to work with in the US East (N. Virginia) Region us-east-1. If you’d like to deploy this solution in a different AWS Region, refer to the instructions given in the “Deploying in a different AWS Region” section of this blog.

Walkthrough

Transferring data from Google Drive to Amazon S3 in this post involves:

1. Deploying the CloudFormation template, which provisions an EC2 instance, installing rclone, and configuring rclone remote connections.

2. Completing Google Drive token authorization after connecting to the EC2 instance.

3. Using rclone commands to transfer data from Google Drive to Amazon S3.

In the following sections we go through these steps in more detail.

1. CloudFormation stack deployment

This section will walk you through deploying the CloudFormation template to create necessary resources for data transfer.

1.1. Create stack

1.1.1. Download the CloudFormation template, Final-CFT-GDrive-S3-Review.yaml, designed for this blog solution here and then visit the CloudFormation console.

1.1.2. On the Stacks page, choose Create stack at top right, and then choose With new resources (standard).

1.1.3. On the Create stackpage, select Choose an existing template in prepare template section. In Specify template section, select Upload a template file > Choose file to select a CloudFormation template, Final-CFT-GDrive-S3-Review.yaml, you downloaded earlier.

1.1.4. Select Next.

1.2. Specify stack details

Figure 2. CloudFormation stack configuration

Figure 2: CloudFormation stack configuration

1.2.1. Provide a unique Stack name, as shown in the preceding figure. For example, “Rclone-GDrive-to-S3.”

1.2.2. Select the VPC and Subnet in which to create your EC2 instance.

1.2.3. Enter your preferred Instance Type. Follow the instructions given in the Prerequisites section for details on selecting the compatible instance type.

1.2.4. Select the EC2 key pair.

1.2.5. Enter into the Your IP Address range for where you want to allow inbound traffic to your instance.

1.2.6. Enter the Client ID and Client secret that you created from the Google Drive API Console in the previous steps.

1.3. Configure stack options

You can leave the options as default and select Next.

1.4. Review and create

You can select the check mark at the bottom of the Capabilities section. Then choose to agree to the acknowledgement of creating IAM resources and select Submit.

2. Google Drive authorization token

This section guides you through the process of authorizing rclone to access your Google Drive account.

2.1. Connecting to Google Drive Remote

After completing the CloudFormation deployment, connect to your EC2 instance: “Rclone Instance – GDrive to S3.”

The first step after connecting to the EC2 instance is to authorize rclone for Google Drive access. This is an additional security step along with providing Google Drive client_id and client_secret for rclone access.

Run the following command and enter n because you are working on a remote or headless machine. We are selecting this option because you don’t have a browser in your Ubuntu-based EC2 instance.

rclone config reconnect gdrive-remote:

Figure 3. Configuring Rclone to connect to Google Drive remote

Figure 3: Configuring rclone to connect to Google Drive remote

After entering n, as shown in the following image, rclone provides a link to get the token from Google for rclone. Copy the link and paste it in a browser. Make sure you are logged into the Google account associated with the Google Drive you want to access.

Figure 4. Accessing the link given by Rclone to authenticate with Google Drive

Figure 4: Accessing the link given by rclone to authenticate with Google Drive

2.2. Authorization with Google Drive

After following a series of prompts from Google, you should see a screen that looks like the following image, which contains the authorization code. By the time you are doing this, you have these prompts: Choose an account > rclone wants to access your Google Account > Authorization code.

Figure 5. Authorization code from Google

Figure 5: Authorization code from Google

Copy the authorization code, go back to the rclone EC2 instance, and paste it in the Enter verification code> field.

Rclone provides a prompt asking if you want to Configure this as a team drive, as shown in the following image. Enter your choice that is appropriate to your use case.

Figure 6. Providing the authorization code to Rclone from Google

Figure 6: Providing the authorization code to rclone from Google

2.3. Rclone remote connections configuration

When provisioning the EC2 instance with the CloudFormation template, an initial rclone config file was created for the Amazon S3 and Google Drive configurations. Running the following code shows this initial configuration file, as shown in the following image.

nano /home/ubuntu/.config/rclone/rclone.conf

Figure 7. Rclone configuration file

Figure 7: Rclone configuration file

If you’d like to create or update configurations according to your requirements, then you can directly edit this configuration file by going to the file path. Alternatively, you can enter an interactive configuration session after connecting to your instance by running the rclone config command and then editing or updating from there, as shown in the following image:

Figure 8. Editing Rclone configuration

Figure 8: Editing Rclone configuration

Refer to the Amazon S3 rclone configuration and Google Drive rclone configuration for detailed instructions on customizing remotes to meet your needs.

3. Transferring and managing files between Google Drive and Amazon S3

In this section, you will see how to use rclone to transfer data from Google Drive to Amazon S3 and perform various file management operations.

For this post, Google Drive and Amazon S3 remote connections are named gdrive-remote and s3-remote. You can run the following command to list the objects in the specified path.

rclone ls <remote>:<folder_name>/<subfolder_name>
rclone ls <remote>:<bucket_name>/<folder_name>

Figure 9. Listing objects within specified paths from remotes using Rclone list command

Figure 9: Listing objects within specified paths from remotes using rclone list command

Start by copying files from one remote to another. The rclone copy command copies files from source to destination. For test purposes, consider transferring around 1 TB of data from Google Drive to an S3 bucket.

The direct command to copy is as follows:

rclone copy <source>:<sourcepath> <dest>:<destpath>

However, you need certain conditions in place to address the quotas from the storage service providers while also optimizing transfer speed. Rclone flags add extra functionality to Rclone commands, enabling you to manage data across remotes more efficiently. By using the appropriate flags, you can have a balance between adhering to the quotas and minimizing the time needed for data transfers.

Now you can start with the code, understand how these flags work, and finally implement them to transfer data.

rclone copy \
--tpslimit 200 \
--transfers 200 \
--buffer-size 200M \
--checkers 400 \
--s3-upload-cutoff 100M \
--s3-chunk-size 100M \
--s3-upload-concurrency 50 \
gdrive-remote: \
s3-remote: EXAMPLE-DESTINATION-BUCKET \
-P

The -P/--progress flag helps you view real-time transfer statistics during file operations. The following flags help with different things:

--tpslimit helps you specify or limit the number of transactions per second (TPS). A transaction or a query in this case can be a PUT/GET/ POST if it’s an HTTP backend.

--transfers controls the number of simultaneous file transfers. By default, rclone performs four parallel transfers.

--buffer-size=SIZE specifies the buffer size for each transfer to improve transfer speed. Each --transfer uses the specified amount of memory for buffering.

--checkers=N controls the number of parallel file checkers during operations, such as copying files. Checkers verify file integrity and make sure of correct transfers.

  • Rclone supports Amazon S3 multipart upload. Multipart upload enables the uploading of a single object as multiple parts, each representing a continuous portion of the object’s data. It’s recommended to use multipart uploads instead of single operations when the object size exceeds 100 MB. Refer to the multipart uploads with rclone documentation for more information.

Google Drive quotas

As Google Drive API is a shared service with multiple users, Google sets certain quotas, such as the number of transactions/ queries per a defined time.

You can find more info in the documentation on usage limits from Google Drive.

To check your quotas, refer to the Google cloud view and manage quotas documentation, or you can follow these steps:

1. Visit the Google API Console.

2. Select your project. In the Enabled APIs & Services section, select Google Drive API.

3. Under QUOTAS & SYSTEM LIMITS, you can see properties such as “Queries per minute” and “Queries per minute per user.”

In our case, we have “Queries per minute” and “Queries per minute per user” at 12,000, which roughly translates to 200 per second.

Amazon S3 quotas

On the other hand, Amazon S3 allows significantly higher requests. It allows at least 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per partitioned Amazon S3 prefix, with no limit on the number of prefixes per bucket.

During scaling, temporary 503 errors (slow down) may occur, but they subside once scaling completes. Refer to the best practices design patterns for optimizing Amazon S3 performance.

To make sure that you stay within the quota limits, consider values from both remotes and set the --tpslimit to the lower of the two. Since Google Drive allows only 200 queries per second in this case, set the flag as --tpslimit 200.

Now you can implement the preceding commands. The following image shows that transferring 1001.06 GB of data from Google Drive to Amazon S3 took approximately 10 minutes 15.7 seconds.

Figure 10.1. Rclone copy command with the flags to copy the data from Google Drive to the S3 bucket

Figure 10.1: Rclone copy command with the flags to copy the data from Google Drive to the S3 bucket

Figure 10.2. Output of the copy command showing the details about the transfer

Figure 10.2: Output of the copy command showing the details about the transfer

check: Verifies that the files in the source and destination are identical.

rclone check gdrive-remote: EXAMPLE-DESTINATION-BUCKET -P

Figure 11. Rclone check command

Figure 11. Rclone check command

delete: Deletes files from the given path.

rclone delete gdrive-remote:gdrive-test-folder-2 -P

Figure 12. Rclone delete command

Figure 12. Rclone delete command

Refer to the rclone commands documentation for a complete list of available commands, such as rclone sync if you’d like to sync the source with the destination.

Monitoring EC2 instance

Refer to the rclone documentation to learn more about these flags and when to adjust their values. The different flags discussed in this post primarily use the network bandwidth and memory of the EC2 instance. We recommend experimenting with different flag values while monitoring your instance’s performance to achieve optimal results without exceeding the throttle limits of your EC2 instance.

For example, if you run the preceding command to test transferring 1 TB of data, and then you noticed that your EC2 instance’s CPU usage is 30%, memory usage is 60%, and network usage is 70%. Ideally you would want to stay under 100%, for example a safer value of around 90%. Therefore, these metrics indicate that there is room for growth to increase the values for flags, such as --transfers to enhance the transfer process.

You can monitor your EC2 instance using Amazon CloudWatch, an application performance monitoring service.

Other ways to optimize the data transfer process

The following methods can help you further improve data transfer process:

1. Higher performance compute: A direct option is to select an EC2 instance type with higher performance. Try to have a balance among the following factors: Amazon EC2 usage time, instance cost, and data transfer completion time.

2. TPS quotas: One way to address TPS quotas is that you could request a quota increase from Google.

3. Other relevant rclone flags: You can look into the following flags and evaluate if they are relevant to your preferences. Note that a few of these flags at times may cross the quota values such as TPS, so consider them with caution:

Deploying in a different AWS Region

Modify the solution as follows to deploy the solution in an AWS Region other than N. Virginia:

  • Replace the ImageId value in the CloudFormation template with the Amazon Machine Image (AMI) ID of “Ubuntu Server 22.04 LTS (HVM), SSD Volume Type” for your preferred architecture and AWS Region. Refer to the finding a Linux AMI documentation to look for these details.
  • Replace the region value in the CloudFormation template with your desired AWS Region code.

Cleaning up

You may want to delete the resources created in this post to avoid unwanted future charges. To delete the stack resources, you can delete the CloudFormation stack. In addition, visit the Google Cloud Console credentials page and delete the OAuth credentials that you created earlier.

Conclusion

In this post, we explored how to efficiently transfer large amounts of data from Google Drive to Amazon S3 using Rclone. We automated the setup process of EC2 instance creation, Rclone installation, and remote connections configuration through AWS CloudFormation, reducing manual effort and potential errors. We also explored various Rclone flags that optimize transfer times while staying within service quotas, helping you avoid throttling and delays. This approach allows you to customize the data transfer process to match your specific requirements, ensuring efficient and reliable migrations even for large datasets.

By moving your data to Amazon S3, you can take advantage of cost savings, and its scalability and performance make it a great choice for your data lakes, analytical, and ML/AI applications. Additionally, Amazon S3 provides extensive security features and supports numerous compliance regulations, enhancing data protection and governance.

Feel free to check out other posts that might be helpful in migrating data and monitoring:

Thank you for reading this post. If you have any comments, feel free to post them in the comments section.

Abhijeet Lokhande

Abhijeet Lokhande

Abhijeet is a Senior Solutions Architect working with the AWS Worldwide Public Sector team. He collaborates closely with Higher Education users, helping them align their mission objectives and technology strategies with the capabilities of the AWS Cloud. Additionally, Abhijeet specializes in security and compliance, assisting AWS users in overcoming challenges and providing guidance on security and compliance practices within the AWS ecosystem.

Abhishek Reddy Isireddy

Abhishek Reddy Isireddy

Abhishek is a Solutions Architect at AWS in the Worldwide Public Sector Organization. He works with multiple public sector users, such as Educational Institutions and State and Local Governments, helping them accelerate their Cloud journey. Abhishek is interested in Artificial Intelligence, Storage, Networking, and Databases. In his free time, he likes watching movies, playing badminton, going on hikes, and walks.