Containers
Deploying managed P4d Instances in Amazon Elastic Kubernetes Service with NVIDIA GPUDirectRDMA
In March 2021, Amazon EKS announced support for Amazon EC2 P4d instances, enabling you to launch a fully managed EKS cluster based on the latest NVIDIA A100 GPUs. Amazon EC2 P4d instances are the next generation of GPU-based instances that provide the best performance for machine learning (ML) training and high performance computing (HPC) in the cloud for applications such as natural language processing, object detection and classification, seismic analysis, and genomics research. This post takes you through how you can quickly get started with deploying these instances in a managed EKS cluster.
Product overview:
Each p4d.24xl instance comes equipped with:
- 8x NVIDIA A100 GPUs
- 96vCPUs
- 8x 1 TB of local NVMe storage
- 4×100 Gbps accelerated networking with support for GPUDirectRDMA utilizing Elastic Fabric Adapter (EFA).
A more thorough deep dive on the Amazon EC2 P4d instances is available here. Setting up the P4d instances with all the performance optimizations related to GPUDirectRDMA (GDRDMA) and the 400-Gbps networking requires manual steps. By providing this in a managed service layer such as Amazon EKS with managed node groups, this infrastructure setup is handled automatically, so you focus on running highly scalable distributed accelerated workloads.
Requirements
Install and configure the follow components in your local environment.
eksctl – You need version 0.43.0+ of eksctl.
kubectl – You use Kubernetes version 1.19 in this blog
You also must set up your environment to authenticate and authorize running AWS Command Line Interface (AWS CLI) commands on your behalf. Install v2 and configure your access key and secret token .
Deployment
Setting up the cluster is covered in the following steps . In this example, we walk through running the NVIDIA Collective Communication Library (NCCL) tests to validate utilization of GPUDirectRDMA over Elastic Fabric Adapter (EFA). The AWS samples GitHub repo for EFA on EKS has additional examples tailored to ML workloads.
Step One: In your AWS Region, ensure at least one of the Availability Zones contains P4d instances. You can check availability with the following command:
aws ec2 describe-instance-type-offerings \
--location-type availability-zone \
--filters Name=instance-type,Values=p4d.24xlarge \
--region us-west-2 \
--output table
Step 2: Copy and paste the following code in your editor and replace any values specific to your Region.
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: p4d-cluster
version: "1.19"
region: us-west-2
availabilityZones: ["us-west-2b", "us-west-2c"]
iam:
withOIDC: true
addons:
- name: vpc-cni
version: v1.7.10-eksbuild.1
managedNodeGroups:
- name: p4d-ng-2c
instanceType: p4d.24xlarge
instancePrefix: p4d-ng-2c-worker
privateNetworking: true
availabilityZones: ["us-west-2c"]
efaEnabled: true
minSize: 2
desiredCapacity: 2
maxSize: 4
iam:
withAddonPolicies:
imageBuilder: true
autoScaler: true
ebs: true
fsx: true
cloudWatch: true
This eksctl config file creates a VPC, EKS cluster, and P4d managed node group. Also notice the use of EKS add-ons to ensure your cluster is launched with at least VPC CNI version 1.7.10. This is a requirement for EFA traffic. The VPC is created with a private and public subnet in each Availability Zone specified. By specifying private networking and a Single-AZ in your managed node group, you ensure that your nodes are launched in a single subnet. This is a requirement for worker nodes to communicate over EFA. Note, you may need to request a limit increase to increase you EC2 On-Demand Instance limits — the default is 128 vCPUs for P series instances. This managed node group can require up to 384 vCPUs (4 p4d.24xlarge instances).
If you have an existing VPC, see this example for how to create a node group with eksctl in a single subnet for an existing VPC. For an existing VPC, ensure that you have the correct networking topology for starting the P4d instances. As a best practice, launch your P4d instances in a private subnet, with a NAT Gateway routing to a public subnet with an Internet Gateway.
Now use config file to create your cluster and node group:
eksctl create cluster -f p4d-managed-cluster.yaml
This command takes some time, as eksctl will be creating a cluster and P4d node group in sequential steps. In the logs of the eksctl bootstrap command, you should see a log entry confirming that the EFA device plugin was successfully applied.
2021-04-02 15:10:38 [ℹ] created "kube-system:DaemonSet.apps/aws-efa-k8s-device-plugin-daemonset"
2021-04-02 15:10:38 [ℹ] as you have enabled EFA, the EFA device plugin was automatically installed
kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-57-3.us-west-2.compute.internal Ready <none> 36h v1.19.6-eks-49a6c0
ip-10-0-72-21.us-west-2.compute.internal Ready <none> 36h v1.19.6-eks-49a6c0
Step 3: Next, apply the latest version of the NVIDIA K8s device plugin.
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.9.0/nvidia-device-plugin.yml
Describe one of the nodes by calling kubectl describe node ip-10-0-57-3.us-west-2.compute.internal
, and you can see the allocatable resources:
Capacity:
attachable-volumes-aws-ebs: 39
cpu: 96
ephemeral-storage: 83873772Ki
hugepages-1Gi: 0
hugepages-2Mi: 10562Mi
memory: 1176329072Ki
nvidia.com/gpu: 8
pods: 737
vpc.amazonaws.com/efa: 4
By using eksctl and managed node groups, all the heavy lifting of configuring the infrastructure and networking for EFA with GDRDMA is automatically handled. This includes installing the EFA plugin, which presents the EFA network devices as allocatable resources to pods via the vpc.amazonaws.com/efa
Kubernetes extended resource. Additionally, with the efaEnabled
flag, eksctl automatically handles other EFA prerequisites, including creating an EFA enabled security group, an EC2 placement group, and installing the EFA driver as part of EC2 user data. You can find more details on these steps in the EKS documentation. Next, let’s run the NCCL test to validate our training job throughput.
Step 4: Example Benchmarking
With the base EKS cluster in place, you can then add the Kubeflow MPI Operator for your subsequent tests.
kubectl apply -f https://raw.githubusercontent.com/kubeflow/mpi-operator/master/deploy/v1alpha2/mpi-operator.yaml
Next, clone the aws-samples/aws-efa-eks repo and apply the test configuration:
git clone https://github.com/aws-samples/aws-efa-eks
cd examples
kubectl apply -f nccl-efa-tests.yaml
Once the pods startup and are in the Running
state:
kubectl get pods
kubectl logs -l=mpi_role_type=launcher --tail=-1
you are able to see the NCCL networking call libfabric and use the underlying EFA devices and GPUDirectRDMA. Here is the expected output:
[1,0]<stdout>:nccl-tests-efa-worker-0:26:26 [0] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /opt/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
[1,0]<stdout>:nccl-tests-efa-worker-0:26:26 [0] NCCL INFO NET/OFI Selected Provider is efa
[1,0]<stdout>:nccl-tests-efa-worker-0:26:26 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
[1,0]<stdout>:nccl-tests-efa-worker-0:26:26 [0] NCCL INFO Using network AWS Libfabric
[1,0]<stdout>:NCCL version 2.8.3+cuda11.2
...
[1,4]<stdout>:nccl-tests-efa-worker-0:30:70 [4] NCCL INFO Channel 06 : 13[901d0] -> 4[901c0] [receive] via NET/AWS Libfabric/2/GDRDMA
[1,12]<stdout>:nccl-tests-efa-worker-1:29:72 [4] NCCL INFO Channel 06 : 5[901d0] -> 12[901c0] [receive] via NET/AWS Libfabric/2/GDRDMA
[1,10]<stdout>:nccl-tests-efa-worker-1:27:65 [2] NCCL INFO Channel 05 : 3[201d0] -> 10[201c0] [receive] via NET/AWS Libfabric/1/GDRDMA
[1,2]<stdout>:nccl-tests-efa-worker-0:28:69 [2] NCCL INFO Channel 05 : 11[201d0] -> 2[201c0] [receive] via NET/AWS Libfabric/1/GDRDMA
[1,8]<stdout>:nccl-tests-efa-worker-1:25:69 [0] NCCL INFO Channel 04 : 1[101d0] -> 8[101c0] [receive] via NET/AWS Libfabric/0/GDRDMA
[1,8]<stdout>:nccl-tests-efa-worker-1:25:69 [0] NCCL INFO Channel 00 : 8[101c0] -> 15[a01d0] via P2P/IPC/read
[1,0]<stdout>:nccl-tests-efa-worker-0:26:67 [0] NCCL INFO Channel 04 : 9[101d0] -> 0[101c0] [receive] via NET/AWS Libfabric/0/GDRDMA
...
[1,0]<stdout>:#
[1,0]<stdout>:# out-of-place in-place
[1,0]<stdout>:# size count type redop time algbw busbw error time algbw busbw error
[1,0]<stdout>:# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
[1,0]<stdout>:nccl-tests-efa-worker-0:26:26 [0] NCCL INFO Launch mode Parallel
[1,1]<stdout>:nccl-tests-efa-worker-0:27:72 [1] NCCL INFO comm 0x7fece4000dc0 rank 1 nranks 16 cudaDev 1 busId 101d0 - Init COMPLETE
[1,0]<stdout>: 8 2 float sum 167.5 0.00 0.00 2e-07 167.1 0.00 0.00 1e-07
[1,0]<stdout>: 16 4 float sum 167.3 0.00 0.00 1e-07 167.4 0.00 0.00 1e-07
[1,0]<stdout>: 32 8 float sum 167.9 0.00 0.00 1e-07 167.5 0.00 0.00 1e-07
[1,0]<stdout>: 64 16 float sum 167.7 0.00 0.00 1e-07 167.7 0.00 0.00 6e-08
[1,0]<stdout>: 128 32 float sum 168.0 0.00 0.00 6e-08 167.9 0.00 0.00 6e-08
[1,0]<stdout>: 256 64 float sum 168.6 0.00 0.00 6e-08 168.9 0.00 0.00 6e-08
[1,0]<stdout>: 512 128 float sum 374.7 0.00 0.00 6e-08 170.1 0.00 0.01 6e-08
[1,0]<stdout>: 1024 256 float sum 182.5 0.01 0.01 5e-07 182.3 0.01 0.01 5e-07
[1,0]<stdout>: 2048 512 float sum 205.0 0.01 0.02 5e-07 205.0 0.01 0.02 5e-07
[1,0]<stdout>: 4096 1024 float sum 233.3 0.02 0.03 5e-07 234.4 0.02 0.03 5e-07
[1,0]<stdout>: 8192 2048 float sum 250.5 0.03 0.06 5e-07 249.5 0.03 0.06 5e-07
[1,0]<stdout>: 16384 4096 float sum 254.2 0.06 0.12 5e-07 253.9 0.06 0.12 5e-07
[1,0]<stdout>: 32768 8192 float sum 260.1 0.13 0.24 5e-07 259.7 0.13 0.24 5e-07
[1,0]<stdout>: 65536 16384 float sum 273.9 0.24 0.45 5e-07 273.8 0.24 0.45 5e-07
[1,0]<stdout>: 131072 32768 float sum 294.2 0.45 0.84 5e-07 294.2 0.45 0.84 5e-07
[1,0]<stdout>: 262144 65536 float sum 304.9 0.86 1.61 5e-07 305.5 0.86 1.61 5e-07
[1,0]<stdout>: 524288 131072 float sum 409.7 1.28 2.40 5e-07 410.3 1.28 2.40 5e-07
[1,0]<stdout>: 1048576 262144 float sum 483.5 2.17 4.07 5e-07 483.6 2.17 4.07 5e-07
[1,0]<stdout>: 2097152 524288 float sum 660.3 3.18 5.95 5e-07 672.4 3.12 5.85 5e-07
[1,0]<stdout>: 4194304 1048576 float sum 817.0 5.13 9.63 5e-07 817.0 5.13 9.63 5e-07
[1,0]<stdout>: 8388608 2097152 float sum 1228.0 6.83 12.81 5e-07 1223.6 6.86 12.85 5e-07
[1,0]<stdout>: 16777216 4194304 float sum 1895.5 8.85 16.60 5e-07 1900.9 8.83 16.55 5e-07
[1,0]<stdout>: 33554432 8388608 float sum 3106.8 10.80 20.25 5e-07 3104.1 10.81 20.27 5e-07
[1,0]<stdout>: 67108864 16777216 float sum 5567.2 12.05 22.60 5e-07 5566.4 12.06 22.61 5e-07
[1,0]<stdout>: 134217728 33554432 float sum 9388.6 14.30 26.80 5e-07 9343.3 14.37 26.93 5e-07
[1,0]<stdout>: 268435456 67108864 float sum 16865 15.92 29.84 5e-07 16853 15.93 29.86 5e-07
[1,0]<stdout>: 536870912 134217728 float sum 32206 16.67 31.26 5e-07 32151 16.70 31.31 5e-07
[1,0]<stdout>: 1073741824 268435456 float sum 61556 17.44 32.71 5e-07 61303 17.52 32.84 5e-07
Step 5: Cleanup
To clean up the environment, you can delete the entire cluster and node group with the following command:
eksctl delete cluster --name p4d-cluster --region us-west-2 --wait
Conclusion
In this post, you learned how to get started with deploying machine learning applications that take full advantage of P4d instances on EKS. By using eksctl with managed node groups, all of the infrastructure setup required for managed elastic scaling of P4d instances with GPUDirectRDMA over EFA is completely automated. You looked at how the NCCL tests ran an all-reduce job across all 16 GPUs and network bandwidth in the two-node cluster. At AWS, we have already seen several EKS customers move to P4d and reduce their time to complete distributed ML training by nearly 50%, and we are excited to see what kind of improvements you will experience, in addition to new types of machine learning this capability unlocks. As always, feel free to leave feedback and comments on either the AWS sample repository, or the AWS Containers roadmap.