AWS HPC Blog

Deploying Generative AI Applications with NVIDIA NIM Microservices on Amazon Elastic Kubernetes Service (Amazon EKS) – Part 2

This post was contributed by Abhishek Sawarkar (NVIDIA), Alex Iankoulski (AWS), Aman Shanbhag (AWS), Deepika Padmanabhan (NVIDIA), Eliuth Triana Isaza (NVIDIA), Jiahong Liu (NVIDIA), Joey Chou (AWS)

Today, we’ll continue with part 2 of our step-by-step guide to help you create a cluster of Amazon Elastic Compute Cloud (Amazon EC2) G5 instances (g5.48xlarge), accelerated by 8 x NVIDIA A10G Tensor Core GPUs, using Amazon Elastic Kubernetes Service (Amazon EKS) to host your inference solution via NVIDIA NIM inference microservices, for deploying AI models at scale. You can check out part 1 here, where we walk you through getting set up with your NVIDIA NIM quickly.

Note: As of writing this blog, AWS has announced General Availability of G6e Amazon EC2 Instances, powered by NVIDIA L40S Tensor Core GPUs. To learn more about the G6e EC2 instance, check out Amazon EC2 G6e instances. You may use the cluster deployment template in awsome-inference to deploy an Amazon EKS cluster of G6e instances, and follow along with the rest of the blog! Additionally, AWS publicly announced the Amazon EC2 P5e instances powered by NVIDIA H200 GPUs available on Capacity Blocks for ML today. To use these P5e instances, you can follow the instructions to deploy an EKS cluster with Capacity Blocks on the awsome-inference/1.infrastructure README.

Amazon EKS is a managed service for running Kubernetes workloads on AWS. We can use Amazon EKS to orchestrate NVIDIA NIM (plural: NIM microservices) pods across multiple nodes, because it automatically manages the availability and scalability of the Kubernetes control plane nodes, which are responsible for scheduling containers, managing application availability, storing cluster data, and other key tasks. EKS also integrates with AWS networking, security, and infrastructure – leveraging all the performance, scale, reliability, and availability of AWS infrastructure. You can find all the details in the Amazon EKS documentation.

Similar to part 1, you can find the code in our awsome-inference GitHub repository. This repo contains reference architectures and test cases for inference on AWS. The examples cover a variety of models, use-cases, and frameworks for inference optimization.

In this blog, we will show you how to deploy your NVIDIA NIM with a customized configuration. Additionally, we’ll also show you how you can set up Prometheus to scrape custom metrics emitted by your NVIDIA NIM pods. We will then show you how you can scale your inference workloads by using these scraped metrics. Specifically, we show you two options to scale. First we show you how to use the Cluster Auto Scaler (CAS) for scaling up your instances + Horizontal Pod Auto Scaler (HPA) for scaling your NIM pods. Second, we show you Karpenter for scaling up your instances + Kubernetes Event Driven Auto Scaler (KEDA) for scaling your NIM pods. Lastly, we will also show you how you can load balance across your NIM pods using an Application Load Balancer (ALB).

Recap on NVIDIA NIM

NVIDIA NIM (referred to as NIM for the rest of this post), part of the NVIDIA AI Enterprise, is available on the AWS Marketplace. NIM is a set of easy-to-use microservices designed for secure, reliable deployment of high-performance AI model inference supporting the next generation of world-class Generative AI applications. NIM offers developers a number of benefits, including but not limited to, ease of use, performance at scale, and security. For more information on NIM, check out the NIM Documentation or Getting Started pages.  To learn more about the hardware support (support matrix) for NIM workloads, check out the Support Matrix on NVIDIA Docs. The main benefits of NIM include its ease of use, performance and scale, and security. To learn more about these benefits, check out part 1 of this blog. To see some performance results on different NIM microservices (LLMs including Llama3, Llama3.1, and Mixtral) on EKS, check out NIM Test Inference Results on the awsome-inference GitHub repository.

Recap on the Architecture Diagram for this blog series & what you’ve provisioned so far…

Figure 1 - This is the architecture diagram of what you will be provisioning during this 2-part-blog series. In part 1, we provisioned EKS Resources (the Control Plane VPC in AWS’ account, that is fully managed), including the pods and worker node groups (g5.48xlarge in this series). Additionally, we also provisioned the Data Plane VPC and associated resources – Subnets (Public & Private), NAT Gateways, and the Internet Gateway. In this blog, we will cover automatic scaling (for both the cluster and pods) and load balancing (ingress of type Application Load Balancer). The Elastic File System deployment is optional, and will be necessary for larger models, like the Llama3.1 405B NIM.

Figure 1 – This is the architecture diagram of what you will be provisioning during this 2-part-blog series. In part 1, we provisioned EKS Resources (the Control Plane VPC in AWS’ account, that is fully managed), including the pods and worker node groups (g5.48xlarge in this series). Additionally, we also provisioned the Data Plane VPC and associated resources – Subnets (Public & Private), NAT Gateways, and the Internet Gateway. In this blog, we will cover automatic scaling (for both the cluster and pods) and load balancing (ingress of type Application Load Balancer). The Elastic File System deployment is optional, and will be necessary for larger models, like the Llama3.1 405B NIM.

In part 1, we provisioned EKS Resources (the Control Plane VPC in AWS’ account, that is fully managed), including the pods and worker node groups (g5.48xlarge in this series). Additionally, we also provisioned the Data Plane VPC and associated resources – Subnets (Public & Private), NAT Gateways, and the Internet Gateway. In this blog, we will cover automatic scaling (for both the cluster and pods) and load balancing (ingress of type Application Load Balancer). The Elastic File System deployment is optional, and will be necessary for larger models, like the Llama3.1 405B NIM.

Deploying a customized NIM

The values.yaml file consists of all the NIM configurations that you would need to deploy the HELM charts. In the first part of this blog series, to be able to deploy the Llama3-8B-Instruct NIM quickly, we simply ran the command helm install my-nim nim-llm/ --set model.ngcAPIKey=$NGC_CLI_API_KEY --set persistence.enabled=true, which used the values.yaml file provided by NVIDIA (with some pre-baked values). To be able to deploy a custom values.yaml file, you have two options.

First, you could follow the same method as in part 1, and simply edit the relevant parameters in the same values.yaml file. For example, if you’d like to deploy the Llama3.1-8B-Instruct model instead, you’d simply replace the repository:

image:
  repository: nvcr.io/nim/meta/llama3-8b-instruct
  pullPolicy: IfNotPresent
  tag: "latest"

with:

image:
  repository: nvcr.io/nim/meta/llama-3.1-8b-instruct
  pullPolicy: IfNotPresent
  tag: "latest"

Alternatively, you can define your own custom values.yaml file to be able to define your own values for your NIM deployment. To do so, you’d need to follow a few additional steps. For more information on all the NIM container images available for you to deploy today, check out the NGC Catalog.

You’ll need to setup your Kubernetes Secrets. There are two types of secrets you’ll need to create: a Generic Secret, and a Docker Registry Secret, to let you pull images from NGC. This blog assumes that you have used the instructions in part 1 to set up your NGC Personal API Key, and that you have an EKS cluster and other infrastructure up and running as defined by awsome-inference.

export NGC_CLI_API_KEY="key from ngc"
kubectl create secret generic ngc-api --from-literal=NGC_CLI_API_KEY=$NGC_CLI_API_KEY
kubectl create secret docker-registry registry-secret --docker-server=nvcr.io --docker-username='$oauthtoken' --docker-password=$NGC_CLI_API_KEY

In part 1, under “Troubleshooting”, we mentioned that we often saw errors relating to Kubernetes Persistent Volume Claims (PVC) when it came to Amazon Elastic Block Store (Amazon EBS) (since the default helm charts didn’t provision the Amazon EBS Container Storage Interface (CSI)  (EVS CSI) controllers). To avoid that blocker, we have now provided a shell script that uses this storage.yaml file to provision the EBS CSI controller. If you’d like to, it also has the option to let you provision an EFS-CSI controller.

bash setup/setup.sh

Similar to part 1, you can now use Helm to deploy your own custom values.yaml file. For the sake of demonstration, we’ve used custom-values-ebs-sc.yaml , found on NVIDIA’s NIM GitHub, nim-deploy. A fork of this file can also be found in storage/ on awsome-inference. You may follow the README to set up your own custom values file.

Monitoring & Observability

As part of the /metrics endpoint (available at http://localhost:8000/metrics in the context of the NIM pod), NIM pods emit a bunch of metrics indicating request statistics. We will be using Prometheus to scrape these custom metrics, and use them to scale our inference workload. Specifically, we will be scraping the num_requests_running metric, which is defined as the number of requests currently running on the GPU(s).

To do so, you’ll first need to install the Prometheus stack using Helm. You’ll then need to install the Prometheus adapter. This adapter is what will be used to collect custom metrics (pushed onto the Prometheus instance by your NIM pods). You can follow the instructions on this README to install the Prometheus stack and adapter.

Once the required libraries are installed, you need to define a YAML manifest, to tell the Prometheus Adapter what metric to look for. We’ve provided a manifest called monitoring/custom-rules.yaml that lets you scrape the num_requests_running metric. Feel free to use this file as a template to use other metrics.

You can run kubectl get pods -A and check that all your Prometheus pods are Running and Ready. If you’d like to confirm whether the num_requests_running metric is being scraped by Prometheus (and emitted by your NIM pods), you can access the Prometheus UI. To test this, first send an inference request to your NIM pod by using the instructions in the README.

Once you get an inference response, access the Prometheus UI using

kubectl port-forward svc/prometheus-operated 9090

This command will forward the Prometheus Service Port (9090) to your local machine on port 9090. You can then access the Prometheus UI by opening up http://localhost:9090 on your web browser.

On the Web UI, check the “Targets”: In the Prometheus UI, navigate to “Status” → “Targets”. This page shows a list of all registered targets (endpoints) that Prometheus is scraping metrics from. Make sure that your NIM pods are on here.

Once you confirm that your pods are being scraped, query the metric: In the Prometheus UI, check out the “Graph” tab and enter num_requests_running as a query. This will show you a time series data for the num_requests_running metric across all the NIM pods that you have deployed.

Figure 2 – Example of a graph on the Prometheus UI, scraping the num_requests_running metric. Here, we check whether the num_requests_running metric is being scraped. This example graph shows us concurrent requests sent to our NIM pods over a period of time.

Figure 2 – Example of a graph on the Prometheus UI, scraping the num_requests_running metric. Here, we check whether the num_requests_running metric is being scraped. This example graph shows us concurrent requests sent to our NIM pods over a period of time.

In the graph in Figure 2, we check whether the num_requests_running metric is being scraped. Note: If you have only sent inference requests using the CURL command above, you should see a flat line at num_requests_running = 1. If you’d like to send concurrent requests, you can use the benchmarking tool genai-perf. We covered sending requests via genai-perf in part 1, under “Benchmarking”. If you do that, you should see a graph that looks like the one above.

If you’d like to visualize your scraped metrics, along with number of nodes, via a dashboard (without having to port forward your prometheus-operated service), you can use Grafana dashboards. Use the instructions in the awsome-inference README to set up the Grafana dashboard.

Scaling your NIM inference workload

In this section, we will provide you with different options to configure scaling your NIM workloads on Amazon EKS. Both of these are valid configurations, and have their own use-cases.

Option 1: Using Horizontal Pod Auto Scaler (HPA) + Cluster Auto Scaler (CAS)

Why would you choose this configuration?

Cluster Auto Scaler (CAS) is responsible for automatically adjusting the size of the EKS cluster by adding or removing nodes based on the resource demands of the running workloads. It monitors the cluster’s overall utilization and scales the cluster accordingly, ensuring that there are enough resources available for the scheduled pods. Horizontal Pod Auto Scaler (HPA), on the other hand, focuses on scaling the number of replicas for individual deployments or replica sets based on predefined metrics, such as CPU or memory utilization. It continuously monitors the resource usage of the pods and adjusts the number of replicas to meet the application’s demand.

This combination of CAS and HPA provides a comprehensive scaling solution for EKS users. CAS ensures that the cluster has the necessary compute resources to handle the overall workload, while HPA fine-tunes the number of replicas for each application to optimize resource utilization and performance.

To use the Kubernetes Horizontal Pod Auto Scaler (HPA) to scale our NIM pods, we would need to install the Metrics Server.

We can then create the HPA resource using either flags passed into the command line, or via a YAML manifest. In this example, since we are using custom metrics to scale, we recommend using the provided scaling/hpa-num-requests.yaml file. This HPA scales based on the aforementioned custom metric num_requests_running as scraped by Prometheus. You can change the following lines to set the thresholds for scaling, or use the file as is:

      target:
        # 10000 milli-requests per second = 10 requests 
        type: AverageValue
        averageValue: 100000m

Lastly, to create the HPA resource, run

kubectl apply -f scaling/hpa-num-requests.yaml

With this, your HPA is deployed! You can now move on to deploy the Cluster Auto Scaler.

In order to deploy the Cluster Auto Scaler (CAS) onto your Kubernetes cluster to scale your nodes, you’d first need to deploy the CAS to your cluster. You can follow the instructions in this repo to set up the CAS in your cluster.

Now, every time you scale up your pods, and a pod is marked as “un-schedulable”, the CAS deployment will provision a new node as long as the target node type is available. To test it, you can also manually scale up the pods using a command like

kubectl scale --replicas=<num-new-pods> deploy llama3

Option 2: Using Kubernetes Event Driver Auto Scaling (KEDA) + Karpenter

Why would you choose this configuration?

Karpenter is an open-source Node Provisioner for Kubernetes, that provides a flexible and event-driven approach to scaling Kubernetes clusters. Karpenter monitors the EKS cluster and automatically provisions or terminates nodes based on the actual resource demands of the running workloads. It leverages Amazon EC2 Fleet and supports various scheduling strategies, such as bin-packing or over-provisioning, to optimize resource utilization and reduce costs. KEDA, on the other hand, is an event-driven auto-scaler that monitors various event sources, such as custom metrics, and automatically scales the number of replicas for the associated workloads. The combination of Karpenter and KEDA provides a powerful scaling solution for EKS users, particularly for event-driven architectures and workloads with unpredictable resource demands.

To get started with KEDA, you’d need to deploy KEDA onto your EKS Cluster. You can follow the following steps to do so using HELM.

To configure auto scaling at the pod level (similar to CAS), we use KEDA to define a custom resource called ScaledObject, as shown in the file provided in scaling/kedas-config.yaml:

The manifest defines a resource ScaledObject named keda-prometheus-hpa. This ScaledObject is responsible for scaling our NIM pod deployments and has a minReplicaCount of 1, meaning 1 pod is always running. As required, it scales the pods based on the custom metric num_requests_running, as emitted by the NIM pods and scraped by Prometheus. We use the query max(num_requests_running) by (pod) so that if any single pod has more than 100 incoming requests, we scale up. KEDA also scales down the replicas after the num_requests_running has been below 100 for more than 5 minutes.

To set up node scaling (similar to CAS), you can use Karpenter. You can use the Karpenter Documentation to run the steps required to set up Karpenter on your EKS Cluster.

Once you have Karpenter set up on your EKS Cluster, you can create a manifest to create a resource of type NodePool (i.e., a Karpenter configuration). You can find the Karpenter configuration we use in scaling/karpenter-config.yaml.

In this Karpenter NodePool definition, Karpenter can launch instances from on-demand capacity pools. This is done since this blog assumes that you’re using either On Demand (OD), On Demand Capacity Reservation (ODCR), or Capacity Blocks (CB) for your NIM deployment. If you’d like to use Spot Instances, feel free to add “spot” to that array.

Instances launched must be from the g (GPU accelerated) class, with instance generation being greater than 4. So, in this case, g5 and g6 are acceptable, but g4 is not. Feel free to change these values depending on your specific cluster configuration.

The default NodePool definition also defines disruption policies, which means that any underutilized nodes will be removed (scaled down) so that pods can be consolidated to run on fewer or smaller nodes. The expireAfter parameter specified the maximum lifetime of any node before scaling down.

We also provision an EC2NodeClass with blockDeviceMappings, so Karpenter can provision instances with large enough EBS volumes. This template uses 500GB EBS gp3 volumes, but feel free to change this as required.

Load Balancing Across your NIM Pods

Once you scale your inference workload, you need to ensure that your application traffic is distributed across your NIM pods. To do this, we will be creating a Kubernetes Ingress Resource of type Application Load Balancer.

You will first need to install the AWS Load Balancer Controller. If you prefer using Kubernetes Manifests, you can follow the instructions found in “Install the AWS Load Balancer Controller add-on using Kubernetes Manifests”. Alternatively, if you like using Helm, you can follow the instructions in “Install the AWS Load Balancer Controller using Helm”.

You can find a Kubernetes manifest file called ingress.yaml in the ingress/ sub-directory. This file provisions an internet facing Application Load Balancer (ALB), specifies the health check port and path, and even sets up paths, that let you control how traffic is routed. This file only uses the / path, which means all traffic to NIM pods is sent to the nim-llm-service provisioned as part of your NIM deployment. If you’d like to add more prefixes or paths, check out “Listeners for your Application Load Balancers”. Once you’re happy with your ingress configuration, you can run the following to provision your Application Load Balancer.

kubectl apply -f ingress/ingress.yaml

You can run kubectl get ingress to check whether your ingress was provisioned, and run kubectl describe ingress <ingress-name> to get more details.

If in any case you see that your pods don’t get registered to your Application Load Balancers, it is most likely due to failed health checks. To mitigate this issue, add the Security Group attached to the ALB into the ingress rules of your EC2 instances

To test whether the ALB is able to serve traffic to your pods, you can run the following CURL command:

curl -X 'POST' \
'http://<ALB-DNS>/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "messages": [
    {
      "content": "You are a polite and respectful chatbot helping people plan a vacation.",
      "role": "system"
    },
    {
      "content": "What should I do for a 4 day vacation in Spain?",
      "role": "user"
    }
  ],
  "model": "meta/llama3.1-8b-instruct",
  "max_tokens": 64,
  "top_p": 1,
  "n": 1,
  "stream": false,
  "stop": "\n",
  "frequency_penalty": 0.0
}'

As expected, this should return something like

{"id":"cmpl-228f8ceabea1479caff6142c33478f3b","object":"chat.completion","created":1720470045,"model":"meta/llama3-8b-instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Spain is a wonderful destination! With four days, you can definitely get a taste"},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":42,"total_tokens":58,"completion_tokens":16}}

Clean Up

When you’re done using an Amazon EKS cluster, you should delete the resources associated with it so that you don’t incur any unnecessary costs.

You can delete a cluster with eksctl, the AWS Management Console, or the AWS CLI. Follow the instructions on “Delete a cluster” to delete your EKS cluster.

Conclusion

In this blog series, we’ve shown you how to leverage Amazon EKS to orchestrate the deployment of pods containing NVIDIA NIM microservices, to enable quick-to-setup and optimized large-scale Large Language Model (LLM) inference on Amazon EC2 G5 instances. Additionally, we’ve demonstrated scaling (both pod and cluster) by monitoring for custom metrics via Prometheus, and load balancing. Lastly, we’ve shown you some test results for running your inference workloads with NIM on EKS. You can check out a list of all NIM images available to deploy on build.nvidia.com or the NGC Catalog.

NVIDIA NIM microservices empowers researchers and developers to run optimized inference in their applications, and Amazon EKS can help with the orchestration, scaling, and load balancing. Together, these scale your inference workloads easily and efficiently to thousands of GPUs, resulting in accelerated time-to-market for cutting edge AI applications.

To learn more about deploying NVIDIA NIM microservices and the associated infrastructure, refer to AWS Samples. To learn more about Amazon EKS and NIMs, check out the EKS User Guide, and NVIDIA NIM.

Abhishek Sawarkar

Abhishek Sawarkar

Abhishek Sawarkar is a product manager in the NVIDIA AI Enterprise team working on integrating NVIDIA AI Software in Cloud MLOps platforms. He focuses on integrating the NVIDIA AI end-to-end stack within Cloud platforms & enhancing user experience on accelerated computing.

Aman Shanbhag

Aman Shanbhag

Aman Shanbhag is an Associate Specialist Solutions Architect on the ML Frameworks team at Amazon Web Services, where he helps customers and partners with deploying ML Training and Inference solutions at scale. Before joining AWS, Aman graduated from Rice University with degrees in Computer Science, Mathematics, and Entrepreneurship.

Alex Iankoulski

Alex Iankoulski

Alex Iankoulski is a full-stack software and infrastructure architect who likes to do deep, hands-on work. He is currently a Principal Solutions Architect for Self-managed Machine Learning at AWS. In his role he focuses on helping customers with containerization and orchestration of ML and AI workloads on container-powered AWS services. He is also the author of the open source [Do framework](https://bit.ly/do-framework) and a Docker captain who loves applying container technologies to accelerate the pace of innovation while solving the world's biggest challenges. During the past 10 years, Alex has worked on combating climate change, democratizing AI and ML, making travel safer, healthcare better, and energy smarter.

Joey Chou

Joey Chou

Joey Chou is a Sr. GenAI Specialist at AWS. He has a mixed background among AI, SW, and HW, with experience across applied AI, prototyping, and production.

Deepika Padmanabhan

Deepika Padmanabhan

Deepika Padmanabhan is a Solutions Architect at NVIDIA. She enjoys building and deploying NVIDIA’s Software Solutions in the Cloud. Outside work, she enjoys solving puzzles and playing video games like Age of Empires.

Eliuth Triana Isaza

Eliuth Triana Isaza

Eliuth Triana Isaza is a Developer Relations Manager at NVIDIA empowering Amazon’s AI MLOps, DevOps, Scientists and AWS technical experts to master the NVIDIA computing stack for accelerating and optimizing Generative AI Foundation models spanning from data curation, GPU training, model inference and production deployment on AWS GPU instances. In addition, Eliuth is a passionate mountain biker, skier, tennis and poker player.

Jiahong Liu

Jiahong Liu

Jiahong Liu is a Solution Architect on the Cloud Service Provider team at NVIDIA. He assists clients in adopting machine learning and AI solutions that leverage NVIDIA accelerated computing to address their training and inference challenges. In his leisure time, he enjoys origami, DIY projects, and playing basketball.