AWS Storage Blog
Migrate data from Google Drive to Amazon S3 using Rclone
Whether you choose to operate entirely on AWS or in multicloud and hybrid environments, one of the primary reasons to adopt AWS is the broad choice of services we offer, enabling you to innovate, build, deploy, and monitor your workloads.
Amazon S3 is a great option for Google Drive users seeking a comprehensive storage solution. Amazon S3 offers very high durability and availability, virtually unlimited scalability, and performant, cost-effective storage options. With its pay-as-you-go pricing model and storage classes optimized for various access patterns and cost requirements, Amazon S3 caters to diverse needs, from managing mission-critical data to storing data for backup and archive. To migrate data from Google Drive to S3, you can use rclone, an open-source command-line tool that provides a streamlined solution for transferring data.
In this post, we demonstrate how you can use rclone to move data from Google Drive to Amazon S3. We walk you through the process of setting up an Amazon Elastic Compute Cloud (EC2) instance with rclone installed and configured to transfer data from Google Drive to Amazon S3. The majority of this setup process is automated through AWS CloudFormation. We also explore different rclone flags to reduce data transfer time while addressing service quotas. Rclone simplifies the migration process with its native support for both storage systems, enabling synchronization and efficient data transfer. Its customizable configuration options, such as controlling concurrent transfers and setting transaction rate limits, help optimize the transfer process.
Solution overview
The following diagram illustrates the architecture for transferring data from Google Drive to Amazon S3 using an EC2 instance with rclone installed. The EC2 instance, which is provisioned through CloudFormation, acts as an intermediary to facilitate the data transfer process. Rclone, running on the EC2 instance, copies data from Google Drive to Amazon S3.
Figure 1: Architecture diagram to transfer data from Google Drive to Amazon S3 using Rclone
By using CloudFormation, the deployment and configuration of the EC2 instance and rclone setup are automated. This makes sure of a consistent and reproducible environment while minimizing manual error, saving time, and reducing effort. Rclone, configured on the EC2 instance, connects to both Google Drive and Amazon S3, thus enabling efficient data transfer from Google Drive to S3.
Prerequisites
Following are the prerequisites you will require to implement this solution:
- Make sure you have necessary permissions to access AWS Identity and Access Management (IAM), AWS Secrets Manager, and Amazon EC2.
- You should have Amazon Virtual Private Cloud (Amazon VPC), a subnet for your VPC, and an EC2 key pair. These are necessary as user input selections when you set up the CloudFormation stack.
- Decide between the x86 and AWS Graviton-based instances (ARM), and choose the EC2 instance type that’s offered in your subnet’s Availability Zone (AZ). Refer to the user guide for finding an EC2 instance type to filter by architecture and AZ. For best price-performance, we recommend the ARM instance unless you have a specific requirement for an x86.
- Rclone suggests making your own Google Drive API Client ID instead of using the default one. Refer to the rclone documentation for creating your own Google Drive Client ID and to learn why it’s recommended.
- You need an S3 bucket as a destination data store.
- You should have rclone available on a machine that has a web browser. Install rclone on your local system.
Review the CloudFormation template to understand IAM user permissions and adjust as necessary. Refer to IAM policies best practices for more details. Similarly, check and update security groups for the EC2 instance if needed.
The CloudFormation automates the rclone setup for Amazon S3, but Google Drive needs manual token authorization after connecting to the EC2 instance. We cover this later in the post.
Walkthrough
Transferring data from Google Drive to Amazon S3 in this post involves:
1. Deploying the CloudFormation template, which provisions an EC2 instance, installs rclone, and configures rclone remote connections.
2. Completing Google Drive token authorization after connecting to the EC2 instance.
3. Using rclone commands to transfer data from Google Drive to Amazon S3.
In the following sections we go through these steps in more detail.
1. CloudFormation stack deployment
This section will walk you through deploying the CloudFormation template to create necessary resources for data transfer.
1.1. Create stack
1.1.1. Download the CloudFormation template, CFT-GDrive-to-S3.yaml, designed for this blog solution here, and then visit the CloudFormation console.
1.1.2. On the Stacks page, choose Create stack at top right, and then choose With new resources (standard).
1.1.3. On the Create stack page, select Choose an existing template in the prepare template section. In Specify template section, select Upload a template file > Choose file to select a CloudFormation template, CFT-GDrive-to-S3.yaml, you downloaded earlier.
1.1.4. Select Next.
1.2. Specify stack details
Figure 2: CloudFormation stack configuration
1.2.1. Provide a unique Stack name, as shown in the preceding figure. For example, “Rclone-GDrive-to-S3.”
1.2.2. Select the VPC and Subnet in which to create your EC2 instance.
1.2.3. Enter your preferred Instance Type. Follow the instructions given in the “Prerequisites” section for details on selecting the compatible instance type.
1.2.4. Select the EC2 key pair.
1.2.5. Enter into the Your IP Address range for where you want to allow inbound traffic to your instance.
1.2.6. Enter the Client ID and Client secret that you created from the Google Drive API Console in the previous steps.
1.2.7. Enter your Amazon S3 Bucket Name that you would like to transfer your data into from Google Drive, and select
1.3. Configure stack options
Leave all the options as default and scroll down. Select the check mark at the bottom of the page in the Capabilities section. Then choose to agree to the acknowledgement of creating IAM resources, and select Submit.
1.4. Review and create
Review configurations, scroll down, and select Submit.
2. Google Drive authorization token
This section guides you through the process of authorizing rclone to access your Google Drive account.
2.1. Connecting to Google Drive Remote
After completing the CloudFormation deployment, connect to your EC2 instance: “Rclone Instance – GDrive to S3.”
The first step after connecting to the EC2 instance is to authorize rclone for Google Drive access. This is an additional security step along with providing Google Drive client_id
and client_secret
for rclone access.
Run the following command and enter n
because you are working on a remote or headless machine. We are selecting this option because you don’t have a browser in your Ubuntu-based EC2 instance.
rclone config reconnect gdrive-remote:
Figure 3: Configuring rclone to connect to Google Drive remote
Copy the command that starts with rclone authorize “drive”
as shown in Figure 4.
Figure 4: Accessing the link given by rclone to authenticate with Google Drive
Run it in your local system where rclone is setup as shown in Figure 5.
Figure 5. Authorizing rclone to access Google Drive on a local system using the config token from EC2
2.2. Authorization with Google Drive
Your browser will open with Google asking you to Choose an account for Google Drive access. Select the relevant account for which you would like to transfer the data from. The next prompt would be rclone wants to access your Google Account > Continue. After selecting Continue, you will see a message on your browser showing “Success! All done. Please go back to rclone.”
Now go back to your local system, copy the code that rclone provides you as shown in the following figure that says Paste the following into your remote machine.
Figure 6: Getting code from local system after authorizing rclone to access Drive on browser
Access the Ubuntu rclone EC2 instance, and paste the code you copied in the config token>
field as shown in Figure 7.
Rclone provides a prompt asking if you want to Configure this as a Shared Drive (team drive), as shown in the image. I entered n, because I would not like to configure this as a team drive (shared drive). Enter your choice that is appropriate to your use case.
Figure 7: Authorizing rclone on EC2 to complete Google Drive authorization
2.3. Rclone remote connections configuration
When provisioning the EC2 instance with the CloudFormation template, an initial rclone config file was created for the Amazon S3 and Google Drive configurations. Running the following code shows this initial configuration file, as shown in the following image.
nano /home/ubuntu/.config/rclone/rclone.conf
Figure 8: Rclone configuration file
If you’d like to create or update configurations according to your requirements, then you can directly edit this configuration file by going to the file path. Alternatively, you can enter an interactive configuration session after connecting to your instance by running the rclone config
command and then editing or updating from there, as shown in the following image:
Figure 9: Editing Rclone configuration
Refer to the Amazon S3 rclone configuration and Google Drive rclone configuration for detailed instructions on customizing remotes to meet your needs.
3. Transferring and managing files between Google Drive and Amazon S3
In this section, you will see how to use rclone to transfer data from Google Drive to Amazon S3 and perform various file management operations.
For this post, Google Drive and Amazon S3 remote connections are named gdrive-remote
and s3-remote
. You can run the following command to list the objects in the specified path.
rclone ls <remote>:<folder_name>/<subfolder_name>
rclone ls <remote>:<bucket_name>/<folder_name>
Figure 10: Listing objects within specified paths from remotes using rclone list command
Start by copying files from one remote to another. The rclone copy command copies files from source to destination. For test purposes, consider transferring around 1 TB of data from Google Drive to an S3 bucket.
The direct command to copy is as follows:
rclone copy <source>:<sourcepath> <dest>:<destpath>
However, you need certain conditions in place to address the quotas from the storage service providers while also optimizing transfer speed. Rclone flags add extra functionality to Rclone commands, enabling you to manage data across remotes more efficiently. By using the appropriate flags, you can have a balance between adhering to the quotas and minimizing the time needed for data transfers.
Now you can start with the code, understand how these flags work, and finally implement them to transfer data.
rclone copy \ --tpslimit 200 \ --transfers 200 \ --buffer-size 200M \ --checkers 400 \ --s3-upload-cutoff 100M \ --s3-chunk-size 100M \ --s3-upload-concurrency 50 \ gdrive-remote: \ s3-remote: EXAMPLE-DESTINATION-BUCKET \ -P
The -P/--progress
flag helps you view real-time transfer statistics during file operations. The following flags help with different things:
--tpslimit
helps you specify or limit the number of transactions per second (TPS). A transaction or a query in this case can be a PUT/GET/ POST if it’s an HTTP backend.
--transfers
controls the number of simultaneous file transfers. By default, rclone performs four parallel transfers.
--buffer-size=SIZE
specifies the buffer size for each transfer to improve transfer speed. Each --transfer
uses the specified amount of memory for buffering.
--checkers=N
controls the number of parallel file checkers during operations, such as copying files. Checkers verify file integrity and make sure of correct transfers.
Rclone supports Amazon S3 multipart upload. Multipart upload enables the uploading of a single object as multiple parts, each representing a continuous portion of the object’s data. It’s recommended to use multipart uploads instead of single operations when the object size exceeds 100 MB. Refer to the multipart uploads with rclone documentation for more information.
Google Drive quotas
As Google Drive API is a shared service with multiple users, Google sets certain quotas, such as the number of transactions/ queries per a defined time.
You can find more info in the documentation on usage limits from Google Drive.
To check your quotas, refer to the Google cloud view and manage quotas documentation, or you can follow these steps:
1. Visit the Google API Console.
2. Select your project. In the Enabled APIs & Services section, select Google Drive API.
3. Under QUOTAS & SYSTEM LIMITS, you can see properties such as “Queries per minute” and “Queries per minute per user.”
In our case, we have “Queries per minute” and “Queries per minute per user” at 12,000, which roughly translates to 200 per second.
Amazon S3 quotas
On the other hand, Amazon S3 allows significantly higher requests. It allows at least 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per partitioned Amazon S3 prefix, with no limit on the number of prefixes per bucket.
During scaling, temporary 503 errors (slow down) may occur, but they subside once scaling completes. Refer to the best practices design patterns for optimizing Amazon S3 performance.
To make sure that you stay within the quota limits, consider values from both remotes and set the --tpslimit
to the lower of the two. Since Google Drive allows only 200 queries per second in this case, set the flag as --tpslimit 200
.
Figure 10.1: Rclone copy command with the flags to copy the data from Google Drive to the S3 bucket
Now you can implement the preceding commands. The following image shows that transferring 1001.06 GB of data from Google Drive to Amazon S3 took approximately 10 minutes and 15.7 seconds.
Figure 10.2: Output of the copy command showing the details about the transfer
check: Verifies that the files in the source and destination are identical.
rclone check gdrive-remote: EXAMPLE-DESTINATION-BUCKET -P
Figure 11. Rclone check command
delete: Deletes files from the given path.
rclone delete gdrive-remote:gdrive-test-folder-2 -P
Figure 12. Rclone delete command
Refer to the rclone commands documentation for a complete list of available commands, such as rclone sync
if you’d like to sync the source with the destination.
Monitoring EC2 instance
Refer to the rclone documentation to learn more about these flags and when to adjust their values. The different flags discussed in this post primarily use the network bandwidth and memory of the EC2 instance. We recommend experimenting with different flag values while monitoring your instance’s performance to achieve optimal results without exceeding the throttle limits of your EC2 instance.
For example, if you run the preceding command to test transferring 1 TB of data, and then you noticed that your EC2 instance’s CPU usage is 30%, memory usage is 60%, and network usage is 70%. Ideally you would want to stay under 100%, for example a safer value of around 90%. Therefore, these metrics indicate that there is room for growth to increase the values for flags, such as --transfers
to enhance the transfer process.
You can monitor your EC2 instance using Amazon CloudWatch, an application performance monitoring service.
Other ways to optimize the data transfer process
The following methods can help you further improve data transfer process:
1. Higher performance compute: A direct option is to select an EC2 instance type with higher performance. Try to have a balance among the following factors: Amazon EC2 usage time, instance cost, and data transfer completion time.
2. TPS quotas: One way to address TPS quotas is that you could request a quota increase from Google.
3. Other relevant rclone flags: You can look into the following flags and evaluate if they are relevant to your preferences. Note that a few of these flags at times may cross the quota values such as TPS, so consider them with caution:
Cleaning up
You may want to delete the resources created in this post to avoid unwanted future charges. To delete the stack resources, you can delete the CloudFormation stack. In addition, visit the Google Cloud Console credentials page and delete the OAuth credentials that you created earlier.
Conclusion
In this post, we explored how to efficiently transfer large amounts of data from Google Drive to Amazon S3 using Rclone. We automated the setup process of EC2 instance creation, Rclone installation, and remote connections configuration through AWS CloudFormation, reducing manual effort and potential errors. We also explored various Rclone flags that optimize transfer times while staying within service quotas, helping you avoid throttling and delays. This approach allows you to customize the data transfer process to match your specific requirements, ensuring efficient and reliable migrations even for large datasets.
By moving your data to Amazon S3, you can take advantage of cost savings, and its scalability and performance make it a great choice for your data lakes, analytical, and ML/AI applications. Additionally, Amazon S3 provides extensive security features and supports numerous compliance regulations, enhancing data protection and governance.
Feel free to check out other posts that might be helpful in migrating data and monitoring:
- Migrating Google Cloud Storage to Amazon S3 using AWS DataSync
- Setup memory metrics for Amazon EC2 instances using AWS Systems Manager
Thank you for reading this post. If you have any comments, feel free to post them in the comments section.