Containers
Utilizing NVIDIA Multi-Instance GPU (MIG) in Amazon EC2 P4d Instances on Amazon Elastic Kubernetes Service (EKS)
In November 2020, AWS released the Amazon EC2 P4d instances. The Amazon EC2 P4d instances deliver the highest performance for machine learning (ML) training and high performance computing (HPC) applications in the cloud. This instance comes with the following characteristics:
- Eight NVIDIA A100 Tensor core GPUs
- 96 vCPUs
- 1 TB of RAM
- 400 Gbps Elastic Fabric Adapter (EFA) with support for GPUDirectRDMA
One of the primary benefits of AWS is elasticity. You can elastically scale workloads according to demand where increased utilization of compute triggers additional scale. With P4d instances, you can now reshape compute resources by creating additional slices of NVIDIA GPUs for various workloads called Multi-instance GPU (MIG).
With MIG, you can partition the GPU with dedicated stream multiprocessor isolation based on different memory profiles. With this option, you can dispatch multiple diverse workloads (which do not require the whole memory footprint of a whole GPU) on the same GPU without performance interference.
Scheduling workloads on these slices concurrently with elastically scaling the nodes through Amazon EC2 Auto Scaling allows you to reshape scaled compute. With MIG, EC2 P4d instances can be used for scalable mixed topology workloads. This post walks through an example of running an ML inferencing workload with and without MIG on Amazon Kubernetes Service (EKS).
MIG Profiles
Different MIG profiles exist for each GPU in the P4d instance. Recall that each p4d.24xlarge comes with eight NVIDIA A100s; each A100 is capable of up to 7x 5 GB A100 slices. This means you can have a node with up to 56 accelerators per node. By shepherding requests across all 56 GPU slices, you can run many diverse workloads per node. The following table shows the available profiles per A100 GPU.
$ sudo nvidia-smi mig -lgip
+--------------------------------------------------------------------------+
| GPU instance profiles: |
| GPU Name ID Instances Memory P2P SM DEC ENC |
| Free/Total GiB CE JPEG OFA |
|==========================================================================|
| 0 MIG 1g.5gb 19 7/7 4.95 No 14 0 0 |
| 1 0 0 |
+--------------------------------------------------------------------------+
| 0 MIG 2g.10gb 14 3/3 9.90 No 28 1 0 |
| 2 0 0 |
+--------------------------------------------------------------------------+
| 0 MIG 3g.20gb 9 2/2 19.79 No 42 2 0 |
| 3 0 0 |
+--------------------------------------------------------------------------+
| 0 MIG 4g.20gb 5 1/1 19.79 No 56 2 0 |
| 4 0 0 |
+--------------------------------------------------------------------------+
| 0 MIG 7g.40gb 0 1/1 39.59 No 98 5 0 |
| 7 1 1 |
+--------------------------------------------------------------------------+
As an added feature, you can mix multiple profiles per GPU for further reshaping and scheduling. For the rest of this post, I refer to the MIG profile by the MIG profile ID (third column preceding) for simplicity.
Deployment on EKS
Amazon EC2 P4d instance supports EKS. So, it is possible, through some configuration changes, to deploy a self-managed nodegroup on which to schedule jobs. In the example here, I use Argo Workflows on top of EKS with MIG to show how you can quickly run DAG workflows that use MIG slices in the backend. The configuration changes can be found in the aws-samples/aws-efa-nccl-baseami-pipeline GitHub. This GitHub requires Packer and if you build the components of the packer script and save an Amazon Machine Image (AMI) this is available by default.
Step 1. Start an EKS cluster with the following command:
eksctl create cluster --name=${cluster_name} \
--region=us-west-2 \
--ssh-access --ssh-public-key ~/.ssh/id_rsa.pub \
--without-nodegroup
Step 2. Next, create a managed node group with a p4d node
eksctl create nodegroup --cluster adorable-rainbow-1613757615 \
--name p4d-mig --nodes 1 --ssh-access \
--instance-types p4d.24xlarge \
--full-ecr-access --managed
It is important to note that MIG is disabled by default when launching a P4d instance. A systemd service was created to enable MIG and set up a default partition scheme. The following code is the systemd service created, this systemd unit file starts before the nvidia-fabricmanager service unit starts in the systemd chain.
[Unit]
Description=Create a default MIG configuration
Before=nvidia-fabricmanager.service
Requires=nvidia-persistenced.service
After=nvidia-persistenced.service
[Service]
Type=oneshot
EnvironmentFile=/etc/default/mig
RemainAfterExit=yes
ExecStartPre=/bin/nvidia-smi -mig 1
ExecStart=/opt/mig/create_mig.sh $MIG_PARTITION
TimeoutStartSec=0
[Install]
WantedBy=multi-user.target
The environment file /etc/default/mig defines the $MIG_PARTITION that is used in the script /opt/mig/create_mig.sh.
#!/bin/bash -xe
nvidia-smi mig -cgi $1 -C
This is set by user-data in our AWS Launch Template (LT). In the launch template, you can iterate over versions to create Launch Templates (LTs) with different MIG partition profiles. In the following example, create seven slices of the 5GB A100 profile.
--==BOUNDARY==
Content-Type: text/x-shellscript; charset="us-ascii"
#!/bin/bash
set -o xtrace
echo -e "MIG_PARTITION=19,19,19,19,19,19,19" >> /etc/default/mig
systemctl start aws-gpu-mig.service
--==BOUNDARY==
Step 3. Once the EKS cluster is running and the nodegroup is created with the nodes in Ready state, you can install the NVIDIA MIG-K8s plugin through Helm.
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo add nvgfd https://nvidia.github.io/gpu-feature-discovery
helm repo update
Now, you verify the repo and that the latest version of the nvidia-device-plugin and gpu-feature-discovery plugins are available.
helm search repo nvdp --devel
NAME CHART VERSION APP VERSION DESCRIPTION
nvdp/nvidia-device-plugin 0.8.2 0.8.2 A Helm chart for the nvidia...
helm search repo nvgfd --devel
NAME CHART VERSION APP VERSION DESCRIPTION
nvgfd/gpu-feature-discovery 0.4.1 0.4.1 A Helm chart for gpu-feature-...
You can set a MIG strategy to “MIXED” which allows you to address each MIG GPU slice. Install the plugins:
helm install --version=0.8.2 --generate-name --set migStrategy=${MIG_STRATEGY} nvdp/nvidia-device-plugin
helm install --version=0.4.1 --generate-name --set migStrategy=${MIG_STRATEGY} nvgfd/gpu-feature-discovery
Step 4. After a few minutes, the kubectl describe node should report the 56 GPU slices, which, can be used for allocation.
Capacity:
Attachable-volumes-aws-ebs: 39
Cpu: 96
Ephemeral-storage: 104845292Ki
hugepages-1Gi: 0
hugepages-2Mi: 10562Mi
Memory: 1176334124Ki
nvidia.com/mig-1g.5gb: 56
Pods: 737
Allocatable:
Attachable-volumes-aws-ebs: 39
Cpu: 95690m
Ephemeral-storage: 95551679124
hugepages-1Gi: 0
hugepages-2Mi: 10562Mi
Memory: 1156853548Ki
nvidia.com/mig-1g.5gb: 56
Pods: 737
Step 5. Argo Deployment and Testing
With the base cluster in place, you can go ahead and deploy the Argo Workflows and run through a few tests. Argo Workflows is a Kubernetes plugin purpose-built for orchestrating parallel jobs. In this example, I use Idealo/superresolution workload. This is an ML inferencing example which performs GANs image upscaling.
kubectl create namespace argo
kubectl apply -n argo -f https://raw.githubusercontent.com/argoproj/argo-workflows/stable/manifests/install.yaml
After deploying the Argo Workflow K8s plugin. You can submit the example workflow below. This directed acyclic graph (DAG) generates a loop launching a variable number of ML upscaling jobs and scheduling them on a single MIG slice.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: super-resolution-example-
spec:
entrypoint: super-resolution-result-example
templates:
- name: super-resolution-result-example
steps:
- - name: generate
template: gen-number-list
# Iterate over the list of numbers generated by the generate step preceding
- - name: super-resolution-mig
template: super-resolution-mig
arguments:
parameters:
- name: super-resolution-mig
value: "{{item}}"
withParam: "{{steps.generate.outputs.result}}"
# Generate a list of numbers in JSON format
- name: gen-number-list
script:
image: python:alpine3.6
command: [python]
source: |
import json
import sys
json.dump([i for i in range(0, 56)], sys.stdout)
- name: super-resolution-mig
retryStrategy:
limit: 10
retryPolicy: "Always"
inputs:
parameters:
- name: super-resolution-mig
container:
image: 231748552833.dkr.ecr.us-east-1.amazonaws.com/super-res-gpu:latest
resources:
limits:
nvidia.com/mig-1g.5gb: 1
workingDir: /root/image-super-resolution
command: ["python"]
args: ["super-resolution-predict.py"]
This workflow includes resources that tell the kubernetes scheduler to schedule this onto an instance that can fulfill this request, i.e. the P4d, and allocate a single 5 GB MIG slice to one super-resolution workflow.
Step 6. Submit the job.
argo submit super-res-5g.argo --watch
The loop expands the number of members in the range. You can see that all of the 56 GPU slices are allocated when using kubectl describe node
as shown in the following code block:
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 310m (0%) 0 (0%)
memory 140Mi (0%) 340Mi (0%)
ephemeral-storage 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
nvidia.com/mig-1g.5gb 56 56
Check the Argo logs for status of the workflow.
ServiceAccount: default
Status: Succeeded
Conditions:
Completed True
Created: Tue Jan 05 13:57:35 -0500
Started: Tue Jan 05 13:57:35 -0500
Finished: Tue Jan 05 13:59:02 -0500
Duration: 1 minute 27 seconds
With the workflow and job overhead, you can complete all 56 jobs in about 1 minute 27 seconds. Compared to the whole GPU allocation provided by the nvidia-k8s-plugin, which processes the same workflow in about four minutes. This is because the full GPU is allocated and thus blocked from scheduling further jobs until the eight complete regardless of whether the full GPU is utilized or not; highlighting one of the benefits of MIG.
Cleanup
To clean up the deployment, you can use the eksctl to delete the cluster
eksctl delete cluster --name <cluster-name>
Conclusion
With the NVIDIA multi-instance GPU (MIG) on P4d instances on Amazon Elastic Kubernetes Service it’s now possible to execute large scale disparate inferencing workloads handling multiple requests from a single endpoint. With MIG on P4d you can have up to 56 individual accelerators per P4d instance improving utilization in a multiuser and/or multi request architecture. Excited to see what our customers come up with MIG on P4d.