AWS Storage Blog

Cost efficiently migrate billions of small files with AWS Snowball

Migrating billions of small files to the cloud can be challenging because of the time and cost required to do so. Solutions that move small files to the cloud – whether online or offline — can exponentially increase cost. This is primarily due to the size of each individual file being migrated. For clarity, a small file is anything smaller than 1 MB. For example, I performed a migration performance test where I transferred 1,000,000 small files, each less than 20 KB, from on-premise storage to Amazon S3. This test took 1.5 hours to complete resulting in 3.6 MB per second performance. That means transferring 1,000,000,000 files at this rate would take roughly 62 days, or approximately 1,500 hours to migrate those files to the cloud. Consequently, we would incur additional network charges as a result of the longer transfer time needed. This presents a challenge when needing to migrate datasets with small files quickly, which can result in missed project deadlines.

To accelerate data migration and reduce costs, customers often use AWS Snowball for bulk-data migrations to the cloud. AWS Snow Family is well-suited when there are connectivity limitations, network bandwidth constraints, high-network connection costs, legacy environmental challenges, or data collected in remote locations. Using AWS Snowball, it is challenging to migrate billions of small files without batching small files to speed up data migrations. To remove the inefficiencies of batching files smaller than 100 GB, we developed Snowball Uploader to automate the process and reduce the overhead of this operation. In this blog post, we show you how to use Snowball Uploader to accelerate copying billions of small files to Snowball Edge. Snowball Uploader script is shared as-is and is not officially supported by AWS.

Overview

Snowball Uploader focuses on migrating billions of small files efficiently with AWS Snowball. Using this script, you can shorten your data migration time thus reducing migration cost. Snowball Uploader, written in Python3, works with AWS SDK for Python (Boto3) and AWS Command Line Interface (CLI). This script generates manifest files to identify what to copy to the Snowball device. Then each small file is batched and archived in TAR format. This batching feature makes it possible to reduce time and cost. To ensure the success of your file transfer, Snowball Uploader also generates a success and error log to confirm each file’s transition.

Snowball Uploader

Figure 1: Snowball Uploader operation

Installing prerequisite packages

To get started, install the following items on a workstation that is connected to a Snowball device:

  • Snowball client: this command-line utility is used to unlock, set up, and administer AWS Snow Family Follow directions from the AWS Snowball Edge Data Migration Guide to set-up an AWS profile. Then connect to your Snowball device, and configure it to use the Snowball Uploader script. You need to test copying data to the Snowball device before running the Snowball Uploader script.
  • AWS CLI version 1.16.14+: This utility is used in order to communicate with the Amazon S3 Adapter endpoint on the Snowball.
  • Python 3.5 (or higher): The Snowball Uploader script is written in Python and is also a pre-requisite for the AWS CLI.
  • Boto3: This is the AWS SDK for Python, which makes it easy to integrate your Python application, library, or script with AWS services including Amazon S3, Amazon EC2, and many more.
  • Snowball Uploader: script that copies billions of small files to Snowball efficiently.

Configuring Snowball Uploader script

Before running the snowball_uploder.py command, there are 6 parameters you’ll want to modify based on your needs. Let’s review each default parameter. You have the option to use the default settings shown in the next image, but small modifications can improve your copy operations.

bucket_name = "your-own-bucket"
session = boto3.Session(profile_name='sbe1')
s3 = session.client('s3', endpoint_url='http://10.10.10.10:8080')
# or below
#s3 = session.client('s3', endpoint_url='https://s3.us-east-1.amazonaws.com')
target_path = '.'   ## very important!! change to your source directory
max_tarfile_size = 10 * 1024 ** 3 # 10GiB, 100GiB is max limit of snowball
max_part_size = 500 * 1000 ** 2 # 500MB, 500MiB is max limit of snowball
max_process = 5  # max thread number, set the value to less than filelist files in filelist_dir
if os.name == 'nt':
    filelist_dir = "C:/tmp/fl_logdir_dkfjpoiwqjefkdjf/"  #for windows
else:
    filelist_dir = '/tmp/fl_logdir_dkfjpoiwqjefkdjf/'    #for linux
  • bucket_name: this is the bucket name on your Snowball device.
  • session = boto3.Session(profile_name='sbe1'): this is your AWS profile name used for your credential access to your Snowball. If you don’t have a Snowball device, you can test this script with Amazon S3 by pointing endpoint_url to Amazon S3’s public endpoints (i.e. endpoint_url=’https://s3.[region].amazonaws.com’)
  • target_path: this parameter indicates your source directory path in your operating system, which is the path is used as the root path on Snowball.
    • if target_path = 'move/to/s3/origin/', it will move files to s3://'bucket_name'/move/to/s3/origin/.
    • if target_path = '.', it will move files under s3://'bucket_name'/ so, a directory path on S3 depends on which directory you execute the command py.
    • We recommend you test the script with sample data before applying it to your data, and then verify that the path and file settings are correct after extracting TAR files.
  • max_tarfile_size: this is the TAR file size uploaded to Snowball.
    • The max file size should be under 100 GB which is the maximum size recommended for small file batching on Snowball.
    • py archives files to a TAR file in Snowball. These TAR files can be automatically extracted in S3.
    • Metadata = {“snowball-auto-extract”: “true”}, this metadata is added to the TAR file.
  • max_part_size: max multi S3 part size, which Snowball limits max-multi-part size to 512 MB.
    • This script uses the multi-part-upload feature of S3 to aggregate files into one large TAR file.
  • filelist_dir: the directory location of manifest files.
    • /tmp/fl_logdir_dkfjpoiwqjefkdjf/ directory is fixed. This directory is removed and re-created whenever you run the script with genlist.

Sometimes of these parameters may not be optimal for your environment, or they may consume huge amounts of memory when parameter numbers (max_threads, max_part_size, and max_tarfile_size) are set too high. Doing so can cause the system to lag. To avoid lag, run the Snowball Uploader script several times with sample data. Adjust and tune these parameters using this script for your production migration.

Generating manifest files

To start, we gather source files information to be transferred to the Snowball device. We use this code to generate my manifest files.

ec2-user> python3 snowball_uploader.py genlist

Manifest files will be generated as .txt files under the /tmp directory, which contains the list of source files. In our example, five manifest files are generated from fl_1.txtto fl_5.txt.

ec2-user> ls /tmp/fl_logdir_dkfjpoiwqjefkdjf/
fl_1.txt fl_2.txt fl_3.txt fl_4.txt fl_5.txt

Let’s look at one of the manifest files:

ec2-user> cat f1_1.txt
./snowball_uploader_11_failed.py, ./snowball_uploader_11_failed.py
./success_fl_2.20200226_002049.log, ./success_fl_2.20200226_002049.log
./file_list.txt, ./file_list.txt
./snowball-fl_1-20200218_151840.tar, ./snowball-fl_1-20200218_151840.tar
./bytesio_test.py, ./bytesio_test.py
./filelist_dir1_10000.txt, ./filelist_dir1_10000.txt
./snowball_uploader_14_success.py, ./snowball_uploader_14_success.py
./error_fl_1.txt_20200225_022018.log, ./error_fl_1.txt_20200225_022018.log
./snowball_uploader_debug_success.py, ./snowball_uploader_debug_success.py
./success_fl_1.txt_20200225_022018.log, ./success_fl_1.txt_20200225_022018.log
./snowball_uploader_20_thread.py, ./snowball_uploader_20_thread.py
./success_fl_1.yml_20200229_173222.log, ./success_fl_1.yml_20200229_173222.log
./snowball_uploader_14_ing.py, ./snowball_uploader_14_ing.py

You will notice that there are two file names listed per line, delimited by a comma. The left value is the original file name and the right value is the target, or destination, file name. You can change the file name manually at this stage, or you can use the rename_file() function to modify it automatically. Below is the sample code of rename_file() that helps you change the destination file name.

def rename_file(org_file):
    target_file = org_file
return target_file

Running cp_snowball to migrate files to AWS Snowball

When executing snowball_uploader.py with cp_snowball parameter, it transfers files to your Snowball device.

Demo-Snowball_Uploader

Figure 2: Demo of Snowball Uploader

When the script runs, take notice of the output of this script, such as log, metadata, and the source (dataset) list.

First, Snowball Uploader script creates two log files per job. The success.log file contains the name of the files, which were archived into the TAR file successfully. The error.log file contains the name of the files which do not exist in the file system, even though they were written in the manifest file. With these logs, you can check what data is and is not transferred.

Second, metadata: snowball-auto-extract=true is crucial if you want your TAR file contents extracted in Amazon S3 in the AWS Cloud. When you execute snowball_uploader.py, only TAR files are stored in the Snowball device. When the Snow service uploads your TAR files to Amazon S3, this metadata token will ensure your TAR files are extracted when the files are transferred from Snowball to S3. You can find the metadata information of each TAR file from console log or success.log.

Third, the Snowball Uploader script runs using the static source file lists. When data migration initiatives take too long, files can be deleted, modified, or added. This sometimes creates confusion as it’s unclear which files were moved or retained. Having the original file list generated from genlist helps you to check what files are still on-premises or have moved to your S3 bucket. This is why the script consists of the two options: genlist and cp_snowball.

Uploading individual files versus uploading with Snowball Uploader

Let’s look at the performance results when comparing files uploaded individually using the batching process versus uploading files with the Snowball Uploader script. The first test shows the results from individual file uploads using the AWS CLI. The second test shows the results when we used the Snowball Uploader script.

Target Tool No. of files Total data size NAS -> Snowball time Snowball -> S3 time Failed objects
Snowball performance – files uploaded individually AWS CLI 19,567,430 2,408 GB 1W 113 hours 954
Snowball performance – files uploaded with Snowball Uploader Snowball Uploader 119,577,235 14,708 GB 1W 26 hours 0

Figure 3: Performance comparison of AWS CLI and Snowball Uploader

Based on the table above, we observed seven times better performance using Snowball Uploader vs. the CLI – note the Total Data Size, Snowball-> S3 Time, and Failed Objects columns. However, your performance will vary depending on your environment, your file size, the number of files, and other variables.

Summary

In this blog post, we demonstrated how you can use the Snowball Uploader script to accelerate data migration comprised of many small files with AWS Snowball. With Snowball Uploader, you can shorten your data migration time, thus reducing migration cost. As a result, Snowball Uploader helps you meet your project deadlines. For example, AWS helped Lotte transfer 140 million files in two weeks using Snowball to power LotteON, which is an online shopping mall in Korea.

Use Snowball Uploader script to migrate your small files to AWS today, and be sure your data is safe by testing it several times before using it for your production data migration. To learn more about data migrations with Snowball or other Snow devices, check out the following resources:

Thank you for reading this blog. If you have ideas on how AWS can improve this script, please join us by filing a GitHub pull request, or leave a comment in the comments section.