AWS Storage Blog
Capture and transfer petabytes from disconnected environments with AWS Snowball
Do you need to capture terabytes or petabytes of data from the field, but struggle to offload the data efficiently, keep up with storage capacity, and maintain equipment in extreme environmental conditions? We commonly hear these challenges from customers who are capturing data in remote locations for future analysis. It is time consuming to offload large amounts of data from source systems due to the number of storage devices required and limited data transfer rates. To complicate matters, equipment may have high failure rates in environments where maintenance is inconvenient. Even after the data is copied from the sensors, it often takes weeks to transport the data to the data center, load them on to the storage system, and prepare it for analysis. Finally, operating and maintaining petabytes of storage both in the field and in a data center is labor intensive and expensive.
AWS Snowball Edge can solve these challenges. Snowball Edge devices are ruggedized so they can operate in harsh, disconnected environments. They have high speed ports to facilitate transfer speeds in excess of 1 gigabyte per second. Customers can use Snowball Edge devices to maintain a copy of data locally and transfer a copy of the data directly to AWS. AWS will promptly copy the data to your account, so you can explore the data as quickly as possible. You no longer need to manage the logistics of transporting storage from the field to the data center floor or operate and maintain large storage systems.
In this blog, we describe how to quickly and durably transfer files using AWS Snowball Edge and Amazon S3 compatible storage on Snow in accordance with best practices for accelerating data migrations. Amazon S3 compatible storage on Snow Family devices delivers secure object storage with increased resiliency, scale, and an expanded Amazon S3 API feature-set. This industry agnostic solution facilitates horizontal scaling with parallelized data transfers to migrate data to AWS as fast as your local network constraints will allow. This solution is optimized for larger files, so if your data is comprised of files less than 1 MB, review the snow transfer tool.
Solution overview
We’ll walk through the steps to collect data from a single source, copy the data to a landing zone for durable storage provided by Amazon S3 compatible storage on Snow, and copy the data to an AWS Snowball Edge device for import into Amazon S3. The AWS Snowball Edge import device will then be returned to AWS where the data is transferred to your Amazon S3 buckets.
Offload agent
This solution begins with the offload agent which is a Python script that synchronizes the source data file system with each group using s5cmd. S5cmd is a open-source community tool that provides fast data transfers to Amazon S3 endpoints with parallelism, multi-threading, tab completion, and wildcard support for files. The recommended design is to use Amazon S3 compatible storage on Snowball Edge devices in a cluster configuration to act as a durable landing zone. Amazon S3 compatible storage on Snow provides familiar Amazon S3 APIs to easily develop, deploy and manage applications requiring object storage. Customers benefit from built-in resiliency, security, performance and the most frequently used Amazon S3 features available in AWS Regions in rugged, mobile edge and disconnected environments.
The offload agent introduces the concept of a destination group which can be one or more Snowball Edge devices using Amazon S3 adapter on AWS Snowball or Amazon S3 compatible storage on Snow. This gives you the flexibility to choose the right mix of transfer performance, capacity and cost for your application. For this use case, Amazon S3 compatible storage is uniquely suited for the landing zone due to its durability.
Scaling and performance
You can scale the solution by adding standalone AWS Snowball Edge devices or larger clusters to a destination group. In our testing, with Snowball Edge Compute Optimized devices with AMD EPYC Gen2 and NVMe SSD drives as the Data Transfer Snowball, we transferred data at rates of 1.8 gigabytes/second to a 4 node S3 compatible storage cluster. This test utilized a single instance of the Data Transfer agent writing to each of the cluster nodes concurrently.
Security
Cloud security at AWS is the highest priority. As an AWS customer, you benefit from a data center and network architecture that is built to meet the requirements of the most security-sensitive organizations. Security is a shared responsibility between AWS and you. We strongly encourage customers to review and follow AWS Snowball Edge security documentation and best practices.
Deploying the solution
There are 6 steps to deploy the solution:
- Use OpsHub for AWS Snow Family to create an Amazon Linux 2 EC2-Compatible instance on the data transfer Snowball.
- Install and setup dependencies on the EC2-compatible instance including AWS CLI v2, Snowball Edge client, Go.
- Attach a Direct Network Interface (DNI) to your data transfer EC2 instance.
- Install the data offload agent and associated dependencies (Go and s5cmd).
- Edit your configuration settings (config.json in the repository).
- Launch your job!
Prerequisites
To deploy this solution, you need:
- An AWS account
- Snowball Edge devices
- Network switch to facilitate connectivity
Refer to the following table for Snowball Edge device requirements required to deploy the solution:
Refer to the Getting Started section in the AWS Snowball Edge Developer Guide for more information about ordering Snowballs.
In addition to the Snowball Edge devices, you will also need a switch to facilitate networking between the source data devices and the Snowball Edge devices. To maximize network throughput, make sure your switch can provide a 100 Gbps QSFP port for the local compute Snowball. We also recommend at least 25 Gbps SFP28 ports for landing zone and data import zone Snowball Edge devices. See the AWS Snowball Edge device specifications for additional networking specification details.
Deployment Steps
Once you have satisfied the prerequisites, you are ready to deploy the solution. Follow these steps to set up the Data Transfer Snowball:
Step 1: Use OpsHub for AWS Snow Family to create an Amazon Linux 2 EC2-compatible instance on the Snowball
Create the EC2 instance that will run the copy job. Open OpsHub, unlock the Snowball, and launch an Amazon EC2-compatible instance. Be sure to select your Amazon Linux 2 AMI and QSFP Network Interface. You’ll need the key pair to Secure Shell (SSH) into this instance.
Step 2: Install and setup dependencies on the EC2-compatible instance
Secure Shell (SSH) in to the newly created instance and run the following commands to install AWS CLI v, Snowball Edge client, and go:
sudo yum update -y
sudo yum install go -y
go install github.com/peak/s5cmd/v2@master
sudo yum remove awscli -y
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
wget https://snowball-client.s3.us-west2.amazonaws.com/latest/snowball-client-linux.tar.gz
tar -xzvf snowball-client-linux.tar.gz
# edit ~/.bash_profile and replace this line to add s5cmd and snowballEdge client to path
if [ -f ~/.bashrc ]; then
. ~/.bashrc
fi
# User specific environment and startup programs
PATH=$PATH:$HOME/.local/bin:$HOME/bin:$HOME/snowball-client-linux-1.2.0-664/bin:$HOME/go/bin
export PATH
export AWS_REGION='snow'
Step 3: Attach a Direct Network Interface to your data transfer EC2-compatible instance
In this solution, the EC2 instance that hosts the data transfer agent leverages a direct network interface (DNI) to achieve the network throughput required to move data between the source data systems and the destination groups. It is strongly recommended to attached this DNI to the QSFP28 port to eliminate the chance of this being a bottleneck. This is a sample network configuration: Follow the guide below to add the DNI to your EC2 instance:
#Get the interface id
snowballEdge describe-device --profile datatransfer
"PhysicalNetworkInterfaces" : [ {
"PhysicalNetworkInterfaceId" : "s.ni-876b27c7035d7aecb",
"PhysicalConnectorType" : "QSFP",
"IpAddressAssignment" : "STATIC",
"IpAddress" : "192.168.88.91",
"Netmask" : "255.255.255.0",
"DefaultGateway" : "192.168.88.1",
"MacAddress" : "38:68:dd:06:6f:36"
}
# Get the EC2 instance ID
aws ec2 describe-instances --profile datatransfer --endpoint http://192.168.88.91:8008 --region snow
{
"Reservations": [
{
"Instances": [
{
"AmiLaunchIndex": 0,
"ImageId": "s.ami-0ecfe5422d1114aea",
"InstanceId": "s.i-831b31648e8f5e1a8",
"InstanceType": "sbe-c.4xlarge",
"KeyName": "dataimport",
"LaunchTime": "2023-08-02T18:25:57.673000+00:00",
"PrivateIpAddress": "34.223.14.193",
"PublicIpAddress": "192.168.88.62"
}
],
"ReservationId": "s.r-87cc2bcfb1e7bb57a"
}
]
}
# Create and attach the DNI to the EC2 instance
snowballEdge create-direct-network-interface \
--endpoint https://[datatransfer_snowball_ip] \
--instance-id [ec2_instance_id] \
--manifest-file [path_to_manifest_file] \
--physical-network-interface-id s.ni-876b27c7035d7aecb \
--profile datatransfer \
--unlock-code [unlockcode]
After you create a DNI and associate it with your EC2-compatible instance, you must make two configuration changes inside your Amazon EC2-compatible instance.
- The first is to change ensures that packets meant for the VNI associated with the EC2-compatible instance are sent through eth0.
- The second change configures your direct network interface to use either DCHP or static IP when booting.
The Snowball documentation has examples of shell scripts for Amazon Linux 2 and CentOS Linux that make these configuration changes. This documentation assumes you are using DHCP for the DNI. We recommend setting this to a static IP address by editing and or adding the following value in the file: /etc/sysconfig/network-scripts/ifcfg-eth1
Edit /etc/sysconfig/network-scripts/ifcfg-eth1 DEVICE=eth1 BOOTPROTO=static IPADDR=192.168.88.60 NETMASK=255.255.255.0
# Restart network
systemctl restart network
Step 4: Install the Data Offload Agent
Clone the Github repository to install the Data Offload Agent.
git clone https://github.com/aws-samples/data-offload
Step 5: Edit your configuration settings (config.json in the repository)
Refer to the README in the Github repository for detailed descriptions of each variable. We encourage you to test the best setting for your file sizes. We found ~20 workers to be the most optimal transfer performance for files sizes over 1 gigabyte.
{
"log_level": "info",
"num_workers" : "20",
"reporting_frequency" : 5,
"source": "/data/work2",
"destinations" : {
"group1" : {
"type" : "s3compatible",
"snowballs" : [
{
"bucket": "dataoffload",
"endpoint": "https://192.168.88.41",
"profile": "cluster-0",
"name": "cluster-0"
},
{
"bucket": "dataoffload",
"endpoint": "https://192.168.88.43",
"profile": "cluster-1",
"name": "cluster-1"
},
{
"bucket": "dataoffload",
"endpoint": "https://192.168.88.45",
"profile": "cluster-2",
"name": "cluster-2"
},
{
"bucket": "dataoffload",
"endpoint": "https://192.168.88.47",
"profile": "cluster-3",
"name": "cluster-3"
}
]
},
"group2": {
"type": "s3adapter",
"snowballs": [
{
"bucket": "snowtesttwilhoi",
"endpoint": "https://192.168.88.90:8443/",
"profile": "DataImport",
"name": "DataImport"
}
]
}
}
}
Step 6: Launch your job!
Run this command to launch the job:
python3 main.py --config_file config.json
The agent above will run until it has replicated all files between the source directory (typically a NFS or SMB mount) and each of the destination groups. If you run the script again, it will only replicate added or modified files in the source directory.
You can also run the script multiple times with different configuration files if you have multiple sources to copy.
When you run the agent, it will read the configuration file, determine which files should be copied, and report on status. Here is a sample output from the agent:
Initial agent launch output
(snowballdataoffload) [ec2-user@ip-34-223-14-194 snowballdataoffload]$ python3 main.py –config_file config.json 09/21/2023 21:17:50 INFO: ############################ 09/21/2023 21:17:50 INFO: Read config successful 09/21/2023 21:17:50 INFO: Comparing files in directory /data/work2 to files on destination group1 09/21/2023 21:17:50 INFO: Found 60 pending files and 0 completed files 09/21/2023 21:17:50 INFO: Launched offload process for snowball at https://192.168.88.41 09/21/2023 21:17:50 INFO: Launched offload process for snowball at https://192.168.88.43 09/21/2023 21:17:50 INFO: Launched offload process for snowball at https://192.168.88.45 09/21/2023 21:17:50 INFO: Launched offload process for snowball at https://192.168.88.47 09/21/2023 21:17:50 INFO: Finished launching offload processes for group named group1 09/21/2023 21:17:50 INFO: 09/21/2023 21:17:51 INFO: Comparing files in directory /data/work2 to files on destination group2 09/21/2023 21:17:51 INFO: Found 60 pending files and 0 completed files 09/21/2023 21:17:51 INFO: Launched offload process for snowball at https://192.168.88.90:8443/ 09/21/2023 21:17:51 INFO: Finished launching offload processes for group named group2 09/21/2023 21:17:51 INFO: 09/21/2023 21:17:50 INFO: Waiting for data offload processes to finish
Status Reporting to STDOUT
Current Status: [==================================================] 100% group 1 09/21/2023 21:24:20 INFO: [group1] 209.7 GB of 209.7GB copied. 0.0 GB remaining Current Status: [===================== ] 100% group 2 09/21/2023 21:24:20 INFO: [group2] 83.9 GB of 209.7GB copied. 125.8 GB remaining
Logging
In addition to reporting status in the console, the script also logs progress to the logs directory. Log files include a timestamp in the file name for ease of navigation. If you wish to run the agent in the background with screen or nohup, you can tail this log to monitor progress of a data offload job. Here is a sample log file:
(snowballdataoffload) [ec2-user@ip-34-223-14-194 logs]$ more snowball_data_offload_21_33_21_09_2023.log 09/21/2023 21:33.43 INFO: ############################ 09/21/2023 21:33:43 INFO: Read config successful 09/21/2023 21:33:43 INFO: Comparing files in directory /data/work2 to files on destination group1 09/21/2023 21:33:43 INFO: Found 60 pending files and 0 completed files 09/21/2023 21:33:43 INFO: Launched offload process for snowball at https://192.168.88.41 09/21/2023 21:33:43 INFO: Launched offload process for snowball at https://192.168.88.43 09/21/2023 21:33:43 INFO: Launched offload process for snowball at https://192.168.88.45 09/21/2023 21:33:43 INFO: Launched offload process for snowball at https://192.168.88.47 09/21/2023 21:33:43 INFO: Finished launching offload processes for group named group1 09/21/2023 21:33:43 INFO: 09/21/2023 21:33:43 INFO: Comparing files in directory /data/work2 to files on destination group2 09/21/2023 21:33:43 INFO: Found 60 pending files and 0 completed files 09/21/2023 21:33:43 INFO: Launched offload process for snowball at https://192.168.88.90:8443/ 09/21/2023 21:33:43 INFO: Finished launching offload processes for group named group2 09/21/2023 21:33:43 INFO: 09/21/2023 21:33:43 INFO: Waiting for data offload processes to finish 09/21/2023 21:33:43 INFO: [group1] 0.0GB of 209.7GB copied. 209.7 GB remaining 09/21/2023 21:33:43 INFO: [group2] 0.0GB of 209.7GB copied. 209.7 GB remaining 09/21/2023 21:34:15 INFO: [group1] 0.0GB of 209.7GB copied. 209.7 GB remaining 09/21/2023 21:34:15 INFO: [group2] 0.0GB of 209.7GB copied. 209.7 GB remaining 09/21/2023 21:34:45 INFO: [group1] 0.0GB of 209.7GB copied. 209.7 GB remaining 09/21/2023 21:34:45 INFO: [group2] 0.0GB of 209.7GB copied. 209.7 GB remaining 09/21/2023 21:35:16 INFO: [group1] 0.0GB of 209.7GB copied. 209.7 GB remaining 09/21/2023 21:35:16 INFO: [group2] 0.0GB of 209.7GB copied. 209.7 GB remaining 09/21/2023 21:35:46 INFO: [group1] 209.7GB of 209.7GB copied. 0.0 GB remaining 09/21/2023 21:35:46 INFO: [group2] 4.20GB of 209.7GB copied. 205.5 GB remaining 09/21/2023 21:36:21 INFO: [group1] 209.7GB of 209.7GB copied. 0.0 GB remaining 09/21/2023 21:36:21 INFO: [group2] 125.8GB of 209.7GB copied. 83.9 GB remaining 09/21/2023 21:36:53 INFO: [group1] 209.7GB of 209.7GB copied. 0.0 GB remaining 09/21/2023 21:36:53 INFO: [group2] 125.8GB of 209.7GB copied. 83.9 GB remaining 09/21/2023 21:37:23 INFO: [group1] 209.7GB of 209.7GB copied. 0.0 GB remaining 09/21/2023 21:37:23 INFO: [group2] 172.0GB of 209.7GB copied. 37.7 GB remaining 09/21/2023 21:36:21 INFO: All data offload processes are finished
Conclusion
This blog reviewed the steps to deploy a data offload solution for disconnected environments such as national defense, natural disasters or use cases at the edge. First, we walked through deploying the solution by setting up the data transfer EC2 instance, installing prerequisites, and configuring the DNI. Lastly, we walked through launching the offload agent to kick off a data transfer job.
This solution allows you to offload data in a disconnected environment using Snowball Edge devices while also providing the security, durability and horizontal scaling required to meet the most stringent requirements. Once implemented, you will be able to quickly transfer your data to the cloud and begin working with it as quickly as possible, allowing you to reduce time to insight.
If you need an account to get started, review the first time user page. Read the Getting Started section of the AWS Snowball Edge Developer Guide for more details about how to order your first AWS Snowball Edge.