Networking & Content Delivery
Networking best practices for generative AI on AWS
Introduction
As generative artificial intelligence (generative AI) continues to evolve, the demand for more powerful and efficient computing resources grows, along with the need to manage exponentially increasing amounts of data. Datasets used for training generative AI models are typically measured in terabytes (TB), orders of magnitude bigger than traditional machine learning (ML) datasets whose size is typically measured in gigabytes (GB). This is directly related to the billions of parameters (neural network weights) used in generative AI models. Academic research has shown a correlation between the model size and the data needed for training for optimum performance.
In this context, networking plays a critical role in all phases of the generative AI workflow: data collection, training, and deployment. In this post, we share some networking recommendations and best practices for training and fine-tuning generative AI models on Amazon Web Services (AWS).
Reference architecture
The following diagram shows a sample architecture that can serve as a reference. We detail the components in the rest of this post. Variations of this architecture exist, and it is beyond the scope of this post to show every possible combination. For example, AWS DataSync can write directly to Amazon FSx for Lustre, and Amazon Elastic Compute Cloud (Amazon EC2) instances can read training data from Amazon Simple Storage Service (Amazon S3) through gateway VPC endpoints from AWS PrivateLink.
Data collection
A first step in training generative AI models is to move the data to the AWS Region where the training will occur. Avoid the pitfall of accessing on-premises data sources such as shared file systems and Hadoop clusters directly from compute nodes in AWS. Protocols like Network File System (NFS) and Hadoop Distributed File System (HDFS) are not designed to work over wide-area networks (WANs), and throughput will be low. Instead, first copy the data to AWS using the specialized services we discuss in this section. Amazon S3 is a commonly used service to store training data due to its low cost, high performance, and durability.
Online data copy
If sufficient bandwidth is available, data can be transferred over the network. An AWS DataSync agent is deployed on-premises as a virtual machine (VM) and can read from multiple sources, including NFS and Server Message Block (SMB) file shares, Hadoop clusters, and third-party clouds. Targets for data replication include Amazon S3 and Amazon FSx for Lustre. AWS DataSync provides in-flight encryption and end-to-end data validation. AWS DataSync can also be used for data transfers between different services and AWS Regions, for example, to copy from an Amazon Elastic File System (Amazon EFS) file system to an S3 bucket in a different Region. To copy data from an S3 bucket to a bucket in a different Region, S3 Cross-Region Replication (CRR) is the preferred method.
At the network layer, AWS DataSync supports connectivity through the public Internet, AWS Site-to-Site VPN, and AWS Direct Connect. Using the public internet only requires an internet connection at the on-premises data center but is subject to internet weather. Site-to-Site VPN connections provide a fast and convenient method for connecting to AWS but are limited to 1.25 gigabits per second (Gbps) per tunnel. AWS Direct Connect delivers fast and secure data transfer for sensitive training data sets. It bypasses the public internet and creates a private, physical connection between your network and the AWS global network backbone. AWS Direct Connect is available at speeds up to 100 Gbps in over 100 worldwide locations. 400 Gbps ports are available at selected locations for customers requiring the highest performance.
AWS DataSync supports both public service endpoints, including Federal Information Processing Standard (FIPS), and VPC service endpoints. AWS Direct Connect public virtual interfaces (VIFs) provide a cost-effective way to connect to public and FIPS endpoints. If you choose to use VPC service endpoints, you can choose private or transit VIF. You will be billed by AWS PrivateLink for interface Amazon VPC endpoints that you create to manage and control the traffic between your agent(s) and the DataSync service over PrivateLink. It is worth noting that only control traffic is subject to PrivateLink charges. The files or objects transferred by DataSync are not subject to PrivateLink charges. Refer to the post Improving Performance on AWS and Hybrid Networks for an in-depth analysis of the factors that impact network throughput.
Offline data copy
For larger datasets or sites with connectivity challenges, the AWS Snow Family provides offline data copy functionality. For example, moving 1 Petabyte (PB) of data using a 1 Gbps link can take 4 months, but the same transfer can be made in a matter of days using five AWS Snowball edge storage optimized devices. AWS Snow Family devices are shipped to your facilities, where you can connect them to your network and copy the data. They are then returned to AWS using an E Ink shipping label for easy tracking and data uploading to Amazon S3. All data moved to AWS Snow Family devices is automatically encrypted with 256-bit encryption keys that are managed by the AWS Key Management Service (AWS KMS). Encryption keys are never stored on the device, so your data stays secure during transit. You can use the following table as a reference to choose between online and offline (AWS Snow Family) data transfer.
Offline and online combined
Some use cases may require a combined approach. Initial training data can be transferred offline, after which incremental updates are sent on a continuous or regular basis. This incremental data could be used for model retraining, fine-tuning, or Retrieval Augmented Generation (RAG). For example, after an initial transfer using the AWS Snow Family, AWS Database Migration Service (AWS DMS) ongoing replication can be used to capture changes in a database and send them to Amazon S3.
Training
For the training phase, we can distinguish between two use cases: accessing the training data and exchanging data among training nodes.
Accessing training data
Training data stored in S3 can be accessed by linking it to Amazon FSx for Lustre. The low- latency and high-throughput characteristics of FSx for Lustre are optimized for deep learning, generative AI, and high performance computing (HPC) workloads. If you want to access data directly from Amazon S3, the most scalable option is to use Amazon VPC gateway endpoints. Gateway endpoints provide reliable connectivity to Amazon S3 and Amazon DynamoDB without requiring an internet gateway or a NAT device for your VPC. To further improve Amazon S3 access time, you can use Amazon S3 Express One Zone Storage Class. Amazon S3 Express One Zone is a high performance, single-zone storage class that is purpose-built to deliver consistent, single-digit millisecond data access for your most latency-sensitive applications. S3 Express One Zone is the lowest latency cloud-object storage class available today, with data access speeds up to 10x faster and with request costs 50 percent lower than S3 Standard.
Other AWS services and third-party SaaS products can be accessed using AWS PrivateLink. Use AWS PrivateLink to allow the resources in your VPC to connect to services in other VPCs using private IP addresses as if those services were hosted directly in your VPC. By default, each VPC endpoint can support a bandwidth of up to 10 Gbps per Availability Zone and automatically scales up to 100 Gbps. In case additional bandwidth is required, it is possible to scale traffic using multiple interface endpoints. If you control both the source and destination VPCs, VPC peering provides a connection without bandwidth bottlenecks at no extra cost within an Availability Zone (charges apply for data transfer over VPC peering connections that cross Availability Zones and Regions).
Data exchange among training nodes
There are multiple techniques to improve the performance of information exchange among training nodes. In this section, we will explain three: flattening the network topology, bypassing the operating system, and enabling parallelism in data flows.
Network topology
Networks are built in layers (Figure 3). This reduces complexity while allowing horizontal scalability. Most networks also implement oversubscription, meaning that the bandwidth to the layer on top is less than the aggregate of the bandwidth to the bottom layer. Using Figure 3 as an example, this would mean that the bandwidth from network node 2 to network node 1 is less than the bandwidth from node 2 to node 4 and node 5 combined. This also means that the communication between A and B will have more bandwidth available than the communication between A and C, aside from the fact that it will have a lower latency due to the reduced network hops.
Oversubscription works well for most applications with variable bandwidth requirements, providing a cost-effective way to accommodate the aggregated traffic requirements and occasional spikes under the assumption that not all nodes need to transmit data at the same time. Distributed training algorithms break this assumption when the training nodes exchange training data and when they reconcile the training results to update the model weights.
Amazon EC2 placement groups can influence the placement of EC2 instances in the network topology. A cluster placement strategy packs instances close together inside an Availability Zone. This enables workloads to achieve the low-latency network performance necessary for tightly-coupled node-to-node communication. You can verify the placement of your instances to further optimize your machine learning (ML) jobs using Amazon EC2 instance topology. This API allows you to describe your instance topology, providing a hierarchical view of the relative proximity between instances.
Operating system (OS) bypass
OS bypass is an access model that allows ML applications to communicate directly with the network interface hardware to provide low-latency, reliable transport functionality without the performance tax imposed by the operating system kernel. Amazon EC2 instances commonly used for machine learning implement OS bypass through the Elastic Fabric Adapter (EFA), which we describe later in this section.
Parallelism in data flows
Most networks are highly redundant, meaning there are multiple paths from a given source to a destination (for example, between two EC2 instances). However, a unique path is selected for each single data flow (5-tuple of source IP, destination IP, protocol, source port, and destination port). This has the desirable property that the packets are delivered in order for the upper layer protocols (TCP). On the other hand, the fact that TCP expects ordered packets also means that a single packet lost will impact the arrival of all the packets behind it (an effect called “head of line blocking”). Amazon built Scalable Reliable Datagram (SRD) to take advantage of the multiple paths in the network at the same time. SRD pushes all the packets making up a block of data at once, over up to 64 paths at a time. Besides, SRD relaxed the requirement for in-order packet delivery, reducing the variability in latency and improving the performance of HPC and ML workloads. SRD is another feature of the EFA.
Elastic Fabric Adapter
EFA is an Elastic Network Adapter (ENA) with added capabilities. Its custom-built OS bypass hardware interface and SRD implementation enhance the performance of inter-instance communications. With EFA, ML applications using NVIDIA Collective Communications Library (NCCL) can scale to thousands of CPUs or GPUs. AWS DeepLearning AMI is ready to use with EFA and comes with the required drivers, kernel modules, and software libraries. Detailed usage instructions can be found in the AWS documentation.
Amazon EC2 UltraCluster
For enterprises that want to manage their own training infrastructure, Amazon EC2 UltraClusters consist of thousands of accelerated EC2 instances that are colocated in a given Availability Zone and interconnected using EFA networking in a petabit-scale nonblocking network. EC2 UltraClusters also provide access to Amazon FSx for Lustre to quickly process massive datasets on demand and at scale with sub-millisecond latencies.
Amazon SageMaker HyperPod
For customers looking for a managed solution, SageMaker HyperPod drastically reduces the complexities of ML infrastructure management. It offers automated management of the underlying infrastructure, allowing uninterrupted training sessions even in the event of hardware failures. Sagemaker Hyperpod will enable EFA and NVIDIA’s NCCL drivers in a customized and tested stack inside the Amazon Machine Image (AMI).
Conclusion
AWS has optimized every part of its networking stack to meet the demanding needs of ML applications. AWS DataSync with AWS Direct Connect offers high-throughput connectivity for dataset migration, and AWS PrivateLink provides secure connections to AWS services and partners from within your Amazon VPC. Elastic Fabric Adapter (EFA) reduces system overhead and uses Scalable Reliable Datagram (SRD) to deliver consistently low latencies. By optimizing the entire networking stack, AWS will continue to lead the way in building the high performance networks that will fuel the next phase of growth in the AI/ML revolution. Learn more about how AWS is engineering infrastructure to power generative AI and start transforming your business with generative AI today.
Hernán is a Principal Networking Specialist TAM for Strategic Accounts based in Portland, Oregon. Being a member of both the Networking and AI/ML Technical Field Communities, he helps AWS biggest customers to lay a strong network foundation for generative AI. He has been pushing packets since the 90s and holds 13 active AWS Certifications. In his free time Hernán enjoys traveling, photography, and reading books that explore the edges between science, art and philosophy.
Marcos is an AWS Sr. Machine Learning Solutions Architect based in Florida, US. In that role, he is responsible for guiding and assisting Generative AI US startup organizations in their strategy towards the cloud, providing guidance on how to address high-risk issues and optimize their machine learning workloads. He has more than 25 years of experience with technology, including cloud solution development, machine learning, software development, and data center infrastructure.