Run GenAI inference across environments with Amazon EKS Hybrid Nodes

This blog post was authored by Robert Northard, Principal Container Specialist SA, Eric Chapman, Senior Product Manager EKS, and Elamaran Shanmugam, Senior Specialist Partner SA.

Introduction

Amazon Elastic Kubernetes Service (Amazon EKS) Hybrid Nodes transform how you run generative AI inference workloads across cloud and on-premises environments. Extending your EKS cluster to on-premises infrastructure allows you to deploy AI applications with consistent management and reduced operational complexity. Amazon EKS provides a managed Kubernetes control plane, and EKS Hybrid Nodes enables you to join on-premises infrastructure to the Amazon EKS control plane as worker nodes, eliminating the need to manage the Kubernetes control plane for on-premises deployments. EKS Hybrid Nodes also allows you to run cloud and on-premises capacity together in a single EKS cluster.

EKS Hybrid Nodes enable various AI/machine learning (ML) use cases and architectures, such as the following:

Run services closer to users to support latency-sensitive workloads, including real-time inference at the edge.
Train models with data that must stay on-premises due to data residency requirements.
Run inference workloads closer to source data, such as RAG applications using your knowledge base.
Use elasticity of AWS Cloud for more compute resources during peak demand.
Use existing on-premises hardware.

This post describes a proof of concept for using a single EKS cluster to run AI inference on-premises with EKS Hybrid Nodes and in the AWS Cloud with Amazon EKS Auto Mode. EKS Auto Mode fully automates Kubernetes cluster management for compute, storage, and networking. Learn more about EKS Auto Mode in the Amazon EKS user guide.

Solution overview

For our example inference workload, we deploy a model through NVIDIA NIM. NVIDIA NIMs are microservices optimized by NVIDIA for running AI models on GPUs. We create an EKS cluster enabled for both EKS Hybrid Nodes and EKS Auto Mode, then join our on-premises machines to the cluster as hybrid nodes. For our on-premises deployment, we install the NVIDIA drivers and NVIDIA device plugin for Kubernetes, before deploying the model to EKS Hybrid Nodes. Finally, we deploy the model to EKS Auto Mode nodes, which come preconfigured with the drivers needed for both NVIDIA GPU and AWS instances. This walkthrough doesn’t include steps for establishing hybrid networking and authentication prerequisites for running EKS Hybrid Nodes, which can be found in the Amazon EKS user guide.

Figure 1: A diagram providing a high-level overview of an EKS cluster with both EKS Hybrid Nodes and EKS nodes in-Region.

The preceding figure presents a high-level diagram of the architecture we use in our walkthrough. The Amazon Virtual Private Cloud (VPC) has two public subnets and two private subnets that host the EKS Auto Mode worker nodes. Communication between the control plane and EKS Hybrid Nodes routes through the VPC, out of/into a Transit Gateway or Virtual Private Gateway, and across a private network connection. EKS Hybrid Nodes need reliable network connectivity between the on-premises environment and the AWS Region, which can be established with AWS Site-to-Site VPN, AWS Direct Connect, or a user-managed VPN solution. Routing tables, security groups, and firewall rules must be configured to allow for bidirectional communication between the environments.

Prerequisites

The following prerequisites are necessary to complete this solution:

Amazon VPC with two private and two public subnets, with route to the internet.
AWS Site-to-Site VPN connection between on-premises network and Amazon VPC.
For the on-premises nodes, a CIDR block from the IPv4 RFC-1918 ranges that doesn’t overlap with the VPC CIDR range, or the Kubernetes service IPv4 CIDR.
Hybrid nodes networking requirements for firewall rules, routing tables, and security groups are detailed in the Amazon EKS user guide.
On-premises machines running a hybrid nodes-compatible operating system with NVIDIA drivers and NVIDIA Container Toolkit included in the machine image.
NVIDIA NGC account and API key for accessing NIMs, see the NVIDIA documentation.
The following tools:

Walkthrough

The following steps walk you through this solution.

Creating an EKS Hybrid Nodes and EKS Auto Mode enabled cluster

We use eksctl, a CLI tool for creating and managing clusters on Amazon EKS, to create an EKS cluster enabled for EKS Hybrid Nodes and EKS Auto Mode.

Create a ClusterConfig file, cluster-configuration.yaml. This file includes the autoModeConfig that enables EKS Auto Mode and the remoteNetworkConfig that enables EKS Hybrid Nodes. For more information about valid remoteNetworkConfig values, see Create cluster in the EKS Hybrid Nodes documentation.

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: hybrid-eks-cluster
  region: us-west-2
  version: "KUBERNETES_VERSION"
# Disable default networking add-ons as EKS Auto Mode
# comes integrated VPC CNI, kube-proxy, and CoreDNS
addonsConfig:
  disableDefaultAddons: true

vpc:
  subnets:
    public:
      public-one: { id: "PUBLIC_SUBNET_ID_1" }
      public-two: { id: "PUBLIC_SUBNET_ID_2" }
    private:
      private-one: { id: "PRIVATE_SUBNET_ID_1" }
      private-two: { id: "PRIVATE_SUBNET_ID_2" }
      
   controlPlaneSubnetIDs: ["PRIVATE_SUBNET_ID_1", "PRIVATE_SUBNET_ID_2"]
  controlPlaneSecurityGroupIDs: ["ADDITIONAL_CONTROL_PLANE_SECURITY_GROUP_ID"]
        
autoModeConfig:
  enabled: true
  nodePools: ["system"]
  
remoteNetworkConfig:
  # Either ssm or ira
  iam:
    provider: ssm
  # Required
  remoteNodeNetworks:
  - cidrs: ["172.18.0.0/16"]
  # Optional
  remotePodNetworks:
  - cidrs: ["172.17.0.0/16"]

After creating the ClusterConfig file, create the EKS cluster by running the following command:

eksctl create cluster -f cluster-configuration.yaml

Wait for cluster state to become Active.

aws eks describe-cluster \
    --name hybrid-eks-cluster \
    --output json \
    --query 'cluster.status'

Preparing hybrid nodes

1. EKS Hybrid Nodes need kube-proxy and CoreDNS. Install the add-ons by running the following eksctl commands. EKS Hybrid Nodes automatically receive the label “eks.amazonaws.com/compute-type“: “hybrid”. This label can be used to target workloads at or away from hybrid nodes. To learn more about deploying Amazon EKS add-ons with EKS Hybrid Nodes, see Configure add-ons for hybrid nodes.

aws eks create-addon --cluster-name hybrid-eks-cluster --addon-name kube-proxy
aws eks create-addon --cluster-name hybrid-eks-cluster --addon-name coredns

If you run at least one replica of CoreDNS in the AWS Cloud, then you must allow DNS traffic to the VPC and nodes where CoreDNS is running. Furthermore, your on-premises remote Pod CIDR must be routable from your nodes in Amazon VPC. See the EKS Hybrid Nodes user guide for guidance on running mixed mode clusters.

2. You can join your on-premises nodes to the Amazon EKS control plane as EKS Hybrid Nodes. To do so, install nodeadm, the EKS Hybrid Nodes CLI, which installs and configures the components needed to transform your machines into EKS worker nodes. These components include the kubelet, containerd, and the aws-iam-authenticator. To install nodeadm on your machines and join your nodes to the cluster, follow the steps in the EKS Hybrid Nodes documentation at Connect hybrid nodes. Before running workloads on hybrid nodes, install a compatible Container Network Interface (CNI) driver. Follow configure a CNI for hybrid nodes for steps to set up a CNI with EKS Hybrid Nodes.

When registering nodes, you can modify the kubelet configuration to add node labels or taints, for example topology.kubernetes.io/zone, to specify which zone the hybrid nodes are in. You can also add labels to represent the different capabilities of the GPUs attached for influencing workload scheduling. For EKS Hybrid Nodes capacity with a mix of GPU and non-GPU capacity, it’s recommended that you add a --register-with-taints=nvidia.com/gpu=Exists:NoSchedule taint to GPU nodes, so that non GPU workloads (such as CoreDNS) aren’t scheduled on GPU nodes. Review the Hybrid Nodes documentation for how to modify kubelet configuration when using nodeadm.

3. Validate that your nodes are connected and in a Ready state by running the following kubectl command. You must install a CNI for hybrid nodes to become Ready.

❯ kubectl get nodes -o wide -l eks.amazonaws.com/compute-type=hybrid

NAME                   STATUS   ROLES    AGE  VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION       CONTAINER-RUNTIME
mi-1111111111111111    Ready    <none>   5d   v1.xx.x.  10.80.146.76   <none>        Ubuntu 22.04.4 LTS   5.15.0-131-generic   containerd://1.7.12
mi-2222222222222222    Ready    <none>   5d   v1.xx.x.  10.80.146.28   <none>        Ubuntu 22.04.4 LTS   5.15.0-131-generic   containerd://1.7.12

Installing NVIDIA device plugin for Kubernetes

This section assumes that your on-premises EKS Hybrid Nodes have the necessary NVIDIA drivers and NVIDIA Container toolkit configured. Kubernetes device plugins can be used to advertise system hardware such as GPUs to the kubelet. As this walkthrough uses NVIDIA GPUs, we must install the NVIDIA Device plugin for Kubernetes to expose GPU devices to the Kubernetes scheduler. If the NVIDIA drivers and NVIDIA Container toolkit aren’t included in your machine images and configured so that containerd can use the NVIDIA Container runtime, then you can instead deploy the NVIDIA GPU Operator, which installs these components, along with the NVIDIA Device plugin at runtime.

1. To install the NVIDIA device plugin using kubectl, first download the deployment manifest:

https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.1/deployments/static/nvidia-device-plugin.yml

Review the NVIDIA Device plugin GitHub repository for the latest versions.

2. You do not need to install the NVIDIA device plugin on EKS Auto Mode, the device plugin DaemonSet should only run on hybrid nodes that have GPUs. Update the NVIDIA Device plugin to target hybrid nodes by using the label eks.amazonaws.com/compute-type: hybrid as part of the .spec.template.spec.nodeSelector and add any additional labels if you have a mix of GPU and non GPU workers nodes:

nodeSelector:
  eks.amazonaws.com/compute-type: hybrid

3. Install the NVIDIA Device plugin by applying the manifest:

kubectl apply -f nvidia-device-plugin.yml

4. Use the following command to validate that the NVIDIA device plugin pods are running:

kubectl get pods -n kube-system

You should expect the following output when listing Pods in kube-system for the NVIDIA device plugin, and the DaemonSet should only be scheduled on nodes with a GPU:

NAMESPACE     NAME                                   READY   STATUS
kube-system   nvidia-device-plugin-daemonset-mb8hw   1/1     Running
kube-system   nvidia-device-plugin-daemonset-vtz8h   1/1     Running

5. You can check if the GPU is exposed to the kubelet by validating if the GPU status is visible in nodes allocatable:

kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu" -l eks.amazonaws.com/compute-type=hybrid

The following shows the node allocatable you would expect to see when listing nodes with a GPU attached:

NAME                   GPU
mi-11111111111111111   1
mi-22222222222222222   1

Deploying NVIDIA NIM for inference on EKS Hybrid Nodes

1.Before deploying NVIDIA NIM, configure your container registry and NVIDIA API keys, which are a prerequisite, and replace NGC_API_KEY with your API key:

kubectl create secret docker-registry nvcrio-cred --docker-server=nvcr.io --docker-username='$oauthtoken' --docker-password=$NGC_API_KEY
kubectl create secret generic ngc-api --from-literal=NGC_API_KEY=$NGC_API_KEY

2. Clone the NIM helm chart by running the following command:

 git clone https://github.com/NVIDIA/nim-deploy.git
 cd nim-deploy/helm

3. Create the helm charts overrides. Set the nodeSelector to target your hybrid nodes.

cat > nim.values.yaml <<EOF
image:
    repository: "nvcr.io/nim/mistralai/mistral-7b-instruct-v0.3"
    tag: latest
model:
  ngcAPISecret: ngc-api
nodeSelector:
  eks.amazonaws.com/compute-type: hybrid
resources:
  limits:
    nvidia.com/gpu: 1
persistence:
  enabled: false
imagePullSecrets:
  - name: nvcrio-cred
tolerations:
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"
EOF

You can modify the image repository in values.yaml file to deploy a different model.

helm install nim nim-llm/ -f ./nim.values.yaml

This deployment doesn’t use a model cache. You would likely want to consider using a model cache to speed up application initialization during scaling events. To implement a model cache, you need appropriate CSI drivers configured and storage infrastructure.

Testing NIM with example prompts

1.To test the NIM microservice, create a Kubernetes port-forward to the NIM service:

kubectl port-forward service/nim-nim-llm 8000:8000

2. Run the following curl command and observe the output:

curl -X 'POST' \
  "http://localhost:8000/v1/completions" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
      "model": "mistralai/mistral-7b-instruct-v0.3",
      "prompt": "What is a Kubernetes Pod",
      "max_tokens": 30
      }'

Expect response:

{
    "id": "cmpl-b50fb31c13e4420bac5243047ef5e404",
    "object": "text_completion",
    "created": 1741976435,
    "model": "mistralai/mistral-7b-instruct-v0.3",
    "choices": [
        {
            "index": 0,
            "text": "?\n\nA Kubernetes Pod is the smallest unit of computation in the Kubernetes API object model that represents a portion of a running application. Each",
            "logprobs": null,
            "finish_reason": "length",
            "stop_reason": null,
            "prompt_logprobs": null
        }
    ],
    "usage": {
        "prompt_tokens": 7,
        "total_tokens": 37,
        "completion_tokens": 30
    }
}

You have successfully deployed the model to EKS Hybrid Nodes. Now you deploy the model with the EKS Auto Mode nodes running in the same EKS cluster.

Deploying to EKS Auto Mode

You can deploy workloads that don’t need to run on the EKS Hybrid Nodes in-Region. EKS Auto Mode built-in NodePools don’t have GPU-based instances, thus you must define a NodePool with GPUs. EKS Auto Mode provides out of box integration with NVIDIA GPUs and Neuron devices, so you don’t need to install drivers and device plugins.

1.Create a NodePool with g6 instance family by running the following command:

cat > nodepool-gpu.yaml <<EOF
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu
spec:
  disruption:
    budgets:
    - nodes: 10%
    consolidateAfter: 1h
    consolidationPolicy: WhenEmpty
  template:
    metadata: {}
    spec:
      nodeClassRef:
        group: eks.amazonaws.com
        kind: NodeClass
        name: default
      requirements:
        - key: "karpenter.sh/capacity-type"
          operator: In
          values: ["on-demand"]
        - key: "kubernetes.io/arch"
          operator: In
          values: ["amd64"]
        - key: "eks.amazonaws.com/instance-family"
          operator: In
          values:
          - g6
      taints:
        - key: nvidia.com/gpu
          effect: NoSchedule
      terminationGracePeriod: 24h0m0s
EOF

kubectl apply -f nodepool-gpu.yaml

If your workload has specific network bandwidth, or instance GPU requirements, then consider also setting other well-known EKS Auto Mode supported labels.

2. Update NVIDIA NIM values for deployment on EKS Auto Mode by creating the following file:

cat > nim.values.yaml <<EOF
image:
    repository: "nvcr.io/nim/mistralai/mistral-7b-instruct-v0.3"
    tag: latest
model:
  ngcAPISecret: ngc-api
nodeSelector:
  eks.amazonaws.com/compute-type: auto
resources:
  limits:
    nvidia.com/gpu: 1
persistence:
  enabled: false
imagePullSecrets:
  - name: nvcrio-cred
tolerations:
  - key: nvidia.com/gpu
    effect: NoSchedule
    operator: Exists
EOF

3.Run the following command to upgrade the NIM helm release to a new version:

helm upgrade nim nim-llm/ -f ./nim.values.yaml

4. List NodeClaims to see that EKS Auto Mode has launched a g6.xlarge in the Region to serve the NVIDIA NIM.

> kubectl get nodeclaims

NAME        TYPE        CAPACITY    ZONE         NODE                  READY  
gpu-wq9qr   g6.xlarge   on-demand   us-west-2b   i-33333333333333333   True

To test, repeat the preceding steps, Testing NIM with example prompts.

Cleaning up

To not incur long-term costs, clean up all AWS resources created as part of this post by running the following commands:

helm delete nim
kubectl delete -f nodepool-gpu.yaml
kubectl delete -f nvidia-device-plugin.yml
eksctl delete cluster -f cluster-configuration.yaml

Clean up any other resources you created as part of the prerequisites if they are no longer needed.

Conclusion

This post provides an example of how Amazon EKS Hybrid Nodes can power AI workloads. Hybrid nodes unify your Kubernetes footprint onto Amazon EKS, eliminating the need to manage the Kubernetes control plane and reducing operational overhead.

To learn more and get started with EKS Hybrid Nodes, see the EKS Hybrid Nodes user guide and explore the re:Invent 2024 session (KUB205), which explains how hybrid nodes works, features, and best practices. For more guidance on running AI/ML workloads on Amazon EKS, checkout the Data on EKS project.

Select your cookie preferences

Containers