Containers

Scaling a Large Language Model with NVIDIA NIM on Amazon EKS with Karpenter

Many organizations are building artificial intelligence (AI) applications using Large Language Models (LLMs) to deliver new experiences to their customers, from content creation to customer service and data analysis. However, the substantial size and intensive computational requirements of these models may have challenges in configuring, deploying, and scaling them effectively on graphic processing units (GPUs). Moreover, these organizations would also like to achieve low-latency and high-performance inference in a cost-efficient way.

To help deploying LLMs, NVIDIA introduced NVIDIA Inference Microservices (NIM) containers. These containers are designed to streamline and accelerate the deployment of LLMs by using the capability of Kubernetes with these benefits:

  • Streamlines AI model deployment and management for developers and IT teams.
  • Optimizes performance and resource usage on NVIDIA hardware.
  • Enables enterprises to maintain control and security of their AI deployments.

In this post, we show you how to deploy NVIDIA NIM on Amazon Elastic Kubernetes Service (Amazon EKS), demonstrating how you can manage and scale LLMs such as Meta’s Llama-3-8B on Amazon EKS. We cover everything from prerequisites and installation to load testing, performance analysis, and observability. Furthermore, we emphasize the benefits of using Amazon EKS combined with Karpenter for dynamic scaling and efficient management of these workloads.

Solution overview

This solution deploys the NVIDIA NIM container with the Llama-3-8B model across two g5.2xlarge Amazon Elastic Compute Cloud (Amazon EC2) instances for high availability (HA). Each instance hosts one replica of the NIM container because each g5.2xlarge instance has a single GPU. These nodes are provisioned by Karpenter when the Llama-3-8B model is deployed through the NIM Helm chart. This setup makes sure that resources are used efficiently, scaling up or down based on the demand.

The Horizontal Pod Autoscaler (HPA) can further scale these replicas based on throughput or other metrics, with data provided by Prometheus. Grafana is used for monitoring and visualizing these metrics. To access the LLM model endpoints, the solution uses a Kubernetes service, a NGINX ingress controller, and a Network Load Balancer (NLB). Users can send their inference requests to the NLB endpoint, while the NIM pods pull container images from the NVIDIA NGC Container registry and uses Amazon Elastic File System (Amazon EFS) for shared storage across the nodes.

Figure 1. Architecture of NVIDIA NIM LLM on Amazon EKS

Figure 1. Architecture of NVIDIA NIM LLM on Amazon EKS

Deploying Llama-3-8B model with NIM

To streamline the deployment and management of NVIDIA NIM with the Llama-3-8B model, you use Data on EKS Blueprints with Terraform. This infrastructure as code (IaC) approach makes sure of a consistent, reproducible deployment process, which lays a strong foundation for scalable and maintainable model serving on Amazon EKS.

Prerequisites

Before we begin, make sure you have the following:

Setup

1. Configure the NGC API Key

Retrieve your NGC API key from NVIDIA and set it as an environment variable:

export TF_VAR_ngc_api_key=<replace-with-your-NGC-API-KEY>

2. Installation

Make sure that you update the region as your desired deployment AWS Region in the variables.tf file before deploying the blueprint. Furthermore, confirm that your local AWS Region setting matches the specified AWS Region to prevent discrepancies. For example, set your export AWS_DEFAULT_REGION="<REGION>” to the desired Region.

Then, clone the repository and run the installation script:

git clone https://github.com/awslabs/data-on-eks.git
cd data-on-eks/ai-ml/nvidia-triton-server
export TF_VAR_enable_nvidia_nim=true
export TF_VAR_enable_nvidia_triton_server=false
./install.sh

This installation process takes approximately 20 minutes to complete. If the installation doesn’t complete successfully for any reason, then you can try re-running the install.sh script to reapply the Terraform templates.

3. Verify the installation

When the installation finishes, you may find the configure_kubectl command from the Terraform output. Enter the following command to create or update the kubeconfig file for your cluster. Replace region-code with the Region that your cluster is in.

aws eks --region <region-code> update-kubeconfig --name nvidia-triton-server

Enter the following command to check that the nim-llm pods status is running:

kubectl get all -n nim

You should observe output similar to the following:

NAME                               READY   STATUS    RESTARTS   AGE
pod/nim-llm-llama3-8b-instruct-0   1/1     Running   0          4h2m

NAME                                     TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
service/nim-llm-llama3-8b-instruct       ClusterIP   172.20.5.230   <none>        8000/TCP   4h2m
service/nim-llm-llama3-8b-instruct-sts   ClusterIP   None           <none>        8000/TCP   4h2m

NAME                                          READY   AGE
statefulset.apps/nim-llm-llama3-8b-instruct   1/1     4h2m

NAME                                                             REFERENCE                                TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
horizontalpodautoscaler.autoscaling/nim-llm-llama3-8b-instruct   StatefulSet/nim-llm-llama3-8b-instruct   2/5       1         5         1          4h2m

The llama3-8b-instruct model is deployed with a StatefulSet in NIM namespace. As it is running, Karpenter provisioned a GPU instance. To check the Karpenter provisioned EC2 instances, enter the following command:

kubectl get node -l type=karpenter -L node.kubernetes.io/instance-type

You should observe output similar to the following:

NAME                                         STATUS   ROLES    AGE     VERSION               INSTANCE-TYPE
ip-100-64-77-39.us-west-2.compute.internal   Ready    <none>   4m46s   v1.30.0-eks-036c24b   g5.2xlarge

Testing NIM with example prompts

For demonstration purposes, we’re using port-forwarding with the Kubernetes service instead of exposing the load balancer endpoint. This approach allows you to access the service locally without making the NLB publicly accessible.

kubectl port-forward -n nim service/nim-llm-llama3-8b-instruct 8000

Then, open another terminal and you can invoke the deployed model with an HTTP request with the curl command

curl -X 'POST' \
    "http://localhost:8000/v1/completions" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
    "model": "meta/llama3-8b-instruct",
    "prompt": "Once upon a time",
    "max_tokens": 64
    }'

You should observe output similar to the following.

{
  "id": "cmpl-xxxxxxxxxxxxxxxxxxxxxxxxx",
  "object": "text_completion",
  "created": 1719742336,
  "model": "meta/llama3-8b-instruct",
  "choices": [
    {
      "index": 0,
      "text": ", there was a young man named Jack who lived in a small village at the foot of a vast and ancient forest. Jack was a curious and adventurous soul, always eager to explore the world beyond his village. One day, he decided to venture into the forest, hoping to discover its secrets.\nAs he wandered deeper into",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 5,
    "total_tokens": 69,
    "completion_tokens": 64
  }
}

In this case, it means our deployed Llama3 model is up and running, and it can serve the request.

Autoscaling with Karpenter

Now that you’ve verified the deployed model is functioning correctly, it’s time to test its scaling capability. First, we set up an environment for the testing:

cd gen-ai/inference/nvidia-nim/nim-client

python3 -m venv .venv
source .venv/bin/activate
pip install openai

We have prepared a file named prompts.txt containing 20 prompts. You can use the following command to run these prompts and verify the generated outputs: You should observe an output similar to the following:

python3 client.py --input-prompts prompts.txt --results-file results.txt

You should observe an output similar to the following:

Loading inputs from `prompts.txt`...
Model meta/llama3-8b-instruct - Request 14: 4.68s (4678.46ms)
Model meta/llama3-8b-instruct - Request 10: 6.43s (6434.32ms)
Model meta/llama3-8b-instruct - Request 3: 7.82s (7824.33ms)
Model meta/llama3-8b-instruct - Request 1: 8.54s (8540.69ms)
Model meta/llama3-8b-instruct - Request 5: 8.81s (8807.52ms)
Model meta/llama3-8b-instruct - Request 12: 8.95s (8945.85ms)
Model meta/llama3-8b-instruct - Request 18: 9.77s (9774.75ms)
Model meta/llama3-8b-instruct - Request 16: 9.99s (9994.51ms)
Model meta/llama3-8b-instruct - Request 6: 10.26s (10263.60ms)
Model meta/llama3-8b-instruct - Request 0: 10.27s (10274.35ms)
Model meta/llama3-8b-instruct - Request 4: 10.65s (10654.39ms)
Model meta/llama3-8b-instruct - Request 17: 10.75s (10746.08ms)
Model meta/llama3-8b-instruct - Request 11: 10.86s (10859.91ms)
Model meta/llama3-8b-instruct - Request 15: 10.86s (10857.15ms)
Model meta/llama3-8b-instruct - Request 8: 11.07s (11068.78ms)
Model meta/llama3-8b-instruct - Request 2: 12.11s (12105.07ms)
Model meta/llama3-8b-instruct - Request 19: 12.64s (12636.42ms)
Model meta/llama3-8b-instruct - Request 9: 13.37s (13370.75ms)
Model meta/llama3-8b-instruct - Request 13: 13.57s (13571.28ms)
Model meta/llama3-8b-instruct - Request 7: 14.90s (14901.51ms)
Storing results into `results.txt`...
Accumulated time for all requests: 206.31 seconds (206309.73 milliseconds)
PASS: NVIDIA NIM example
Actual execution time used with concurrency 20 is: 14.92 seconds (14.92 milliseconds)

You can check the generated responses in results.txt, which contains output similar to the following:

The key differences between traditional machine learning models and very large language models (vLLM) are:

1. **Scale**: vLLMs are massive, with billions of parameters, whereas traditional models typically have millions.
2. **Training data**: vLLMs are trained on vast amounts of text data, often sourced from the internet, whereas traditional models are trained on smaller, curated datasets.
3. **Architecture**: vLLMs often use transformer architectures, which are designed for sequential data like text, whereas traditional models may use feedforward networks or recurrent neural networks.
4. **Training objectives**: vLLMs are often trained using masked language modeling or next sentence prediction tasks, whereas traditional models may use classification, regression, or clustering objectives.
5. **Evaluation metrics**: vLLMs are typically evaluated using metrics like perplexity, accuracy, or fluency, whereas traditional models may use metrics like accuracy, precision, or recall.
6. **Interpretability**: vLLMs are often less interpretable due to their massive size and complex architecture, whereas traditional models may be more interpretable due to their smaller size and simpler architecture.

These differences enable vLLMs to excel in tasks like language translation, text generation, and conversational AI, whereas traditional models are better suited for tasks like image classification or regression.

=========

TensorRT (Triton Runtime) optimizes LLM (Large Language Model) inference on NVIDIA hardware by:

1. **Model Pruning**: Removing unnecessary weights and connections to reduce model size and computational requirements.
2. **Quantization**: Converting floating-point models to lower-precision integer formats (e.g., INT8) to reduce memory bandwidth and improve performance.
3. **Kernel Fusion**: Combining multiple kernel launches into a single launch to reduce overhead and improve parallelism.
4. **Optimized Tensor Cores**: Utilizing NVIDIA's Tensor Cores for matrix multiplication, which provides significant performance boosts.
5. **Batching**: Processing multiple input batches concurrently to improve throughput.
6. **Mixed Precision**: Using a combination of floating-point and integer precision to balance accuracy and performance.
7. **Graph Optimization**: Reordering and reorganizing the computation graph to minimize memory access and optimize data transfer.

By applying these optimizations, TensorRT can significantly accelerate LLM inference on NVIDIA hardware, achieving faster inference times and improved performance.

=========

You might still observe one pod, which is because the current pod can still handle the incoming load. To further increase the load, you may introduce more iterations to the script by adding the --iterations flag with the number of iterations you want to run. For example, to run five iterations, you can run the following:

python3 client.py \
  --input-prompts prompts.txt \
  --results-file results.txt \
  --iterations 5

You can also repeat this multiple times. At the same time, you may use the following command to find that there are new pods, and when it’s started it takes some time to be ready.

kubectl get po,hpa -n nim

Later you might get the similar output as the following

NAME            READY   STATUS    RESTARTS   AGE
pod/nim-llm-llama3-8b-instruct-0   1/1     Running   0          35m
pod/nim-llm-llama3-8b-instruct-1   1/1     Running   0          7m39s
pod/nim-llm-llama3-8b-instruct-2   1/1     Running   0          7m39s
pod/nim-llm-llama3-8b-instruct-3   1/1     Running   0          7m39s

NAME                                          REFERENCE             TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
horizontalpodautoscaler.autoscaling/nim-llm-llama3-8b-instruct   StatefulSet/nim-llm   18/5      1         5         4          9d

There is an HPA resource named nim-llm-llama3-8b-instruct, which is deployed alongside the nim-llm Helm chart. The autoscaling is driven by the num_requests_running metric, which is exposed by NIM. We have preconfigured Prometheus Adapter to enable HPA to use this custom metric, which facilitates the autoscaling of NIM pods based on real-time demand.

$ kubectl describe hpa nim-llm-llama3-8b-instruct -n nim

…
Reference:                         StatefulSet/nim-llm-llama3-8b-instruct
Metrics:                           ( current / target )
  "num_requests_running" on pods:  1 / 5
Min replicas:                      1
Max replicas:                      5
Behavior:
  Scale Up:
    Stabilization Window: 0 seconds
    Select Policy: Max
    Policies:
      - Type: Pods     Value: 4    Period: 15 seconds
      - Type: Percent  Value: 100  Period: 15 seconds
  Scale Down:
    Stabilization Window: 300 seconds
    Select Policy: Max
    Policies:
      - Type: Percent  Value: 100  Period: 15 seconds
StatefulSet pods:      4 current / 4 desired
…

At the instance level, Karpenter automatically helps you launch instances if pods are unscheduable, and it matches the NodePool definition. GPU instances (g5) launched for NIM pods because we have configured the NodePool as follows:

nodePool:
  labels:
    - type: karpenter
    - NodeGroupType: g5-gpu-karpenter
  taints:
    - key: nvidia.com/gpu
      value: "Exists"
      effect: "NoSchedule"
  requirements:
    - key: "karpenter.k8s.aws/instance-family"
      operator: In
      values: ["g5"]
    - key: "karpenter.k8s.aws/instance-size"
      operator: In
      values: [ "2xlarge", "4xlarge", "8xlarge", "16xlarge", "12xlarge", "24xlarge"]
    - key: "kubernetes.io/arch"
      operator: In
      values: ["amd64"]
    - key: "karpenter.sh/capacity-type"
      operator: In
      values: ["spot", "on-demand"]

Karpenter provides the flexibility for you to define a range of instance specifications rather than being limited to fixed instance types. When both spot and on-demand instances are configured as options, Karpenter prioritizes the use of Spot Instances using the Price Capacity Optimized allocation strategy. This strategy requests Spot Instances from the pools that we believe have the lowest chance of interruption in the near-term. Then, the fleet requests Spot Instances from the lowest priced pools.

Observability

To monitor the deployment, we’ve implemented the Prometheus stack, which includes both the Prometheus server and Grafana for monitoring capabilities.

Start by verifying the services deployed by the Kube Prometheus stack with the following command:

kubectl get svc -n kube-prometheus-stack

With this command to list the services, you should observe an output similar to the following:

NAME                                             TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
kube-prometheus-stack-grafana                    ClusterIP   172.20.225.77    <none>        80/TCP              10m
kube-prometheus-stack-kube-state-metrics         ClusterIP   172.20.237.248   <none>        8080/TCP            10m
kube-prometheus-stack-operator                   ClusterIP   172.20.118.163   <none>        443/TCP             10m
kube-prometheus-stack-prometheus                 ClusterIP   172.20.132.214   <none>        9090/TCP,8080/TCP   10m
kube-prometheus-stack-prometheus-node-exporter   ClusterIP   172.20.213.178   <none>        9100/TCP            10m
prometheus-adapter                               ClusterIP   172.20.171.163   <none>        443/TCP             10m
prometheus-operated                              ClusterIP   None             <none>        9090/TCP            10m

The NVIDIA NIM LLM service exposes metrics through the /metrics endpoint from the nim-llm service at port 8000. Verify it by running the following:

kubectl get svc -n nim
kubectl port-forward -n nim svc/nim-llm-llama3-8b-instruct 8000

Open another terminal, and enter the following:

curl localhost:8000/metrics

You should observe numerous metrics (such as num_requests_running, time_to_first_token_seconds) in the Prometheus format exposed by the NIM service.

Grafana dashboard

We’ve set up a pre-configured Grafana dashboard that displays several key metrics:

  • Time to First Token (TTFT): The latency between the initial inference request to the model and the return of the first token.
  • Inter-Token Latency (ITL): The latency between each token after the first.
  • Total Throughput: The total number of tokens generated per second by the NIM.

You can find more metrics descriptions in this NVIDIA document.

To view the Grafana dashboard, refer to our guide on the Data on EKS website.

Figure 2. Grafana dashboard example provided by NVIDIA

Figure 2. Grafana dashboard example provided by NVIDIA

Performance testing with the NVIDIA GenAI-Perf tool

GenAI-Perf is a command-line tool designed to measure the throughput and latency of generative AI models because they are served through an inference server. It serves as a standard benchmarking tool, which allows you to compare the performance of different models deployed with inference servers.

To streamline the testing process, in particular because the tool needs a GPU, we’ve provided a pre-configured manifest file, genaiperf-deploy.yaml, which allows you to deploy and run GenAI-Perf within your environment. This setup enables you to quickly assess the performance of your AI models and make sure that they meet your latency and throughput requirements.

cd gen-ai/inference/nvidia-nim
kubectl apply -f genaiperf-deploy.yaml

When the pod is ready with the running status 1/1, first enter into the pod with the following command:

export POD_NAME=$(kubectl get po -l app=tritonserver -ojsonpath='{.items[0].metadata.name}')
kubectl exec -it $POD_NAME -- bash

Then enter the following command to test the deployed NIM Llama3 model:

genai-perf \
  -m meta/llama3-8b-instruct \
  --service-kind openai \
  --endpoint v1/completions \
  --endpoint-type completions \
  --num-prompts 100 \
  --random-seed 123 \
  --synthetic-input-tokens-mean 200 \
  --synthetic-input-tokens-stddev 0 \
  --output-tokens-mean 100 \
  --output-tokens-stddev 0 \
  --tokenizer hf-internal-testing/llama-tokenizer \
  --concurrency 10 \
  --measurement-interval 4000 \
  --profile-export-file my_profile_export.json \
  --url nim-llm-llama3-8b-instruct.nim:8000

You should observe similar output to the following:

2024-07-18 07:11 [INFO] genai_perf.parser:166 - Model name 'meta/llama3-8b-instruct' cannot be used to create artifact directory. Instead, 'meta_llama3-8b-instruct' will be used.
2024-07-18 07:12 [INFO] genai_perf.wrapper:137 - Running Perf Analyzer : 'perf_analyzer -m meta/llama3-8b-instruct --async --input-data artifacts/meta_llama3-8b-instruct-openai-completions-concurrency10/llm_inputs.json --endpoint v1/completions --service-kind openai -u nim-llm.nim:8000 --measurement-interval 4000 --stability-percentage 999 --profile-export-file artifacts/meta_llama3-8b-instruct-openai-completions-concurrency10/my_profile_export.json -i http --concurrency-range 10'
                                                      LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃            Statistic ┃           avg ┃           min ┃           max ┃           p99 ┃           p90 ┃           p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Request latency (ns) │ 3,946,813,567 │ 3,917,276,037 │ 3,955,037,532 │ 3,955,012,078 │ 3,954,685,886 │ 3,954,119,635 │
│     Num output token │           112 │           105 │           119 │           119 │           117 │           115 │
│      Num input token │           200 │           200 │           200 │           200 │           200 │           200 │
└──────────────────────┴───────────────┴───────────────┴───────────────┴───────────────┴───────────────┴───────────────┘
Output token throughput (per sec): 284.85
Request throughput (per sec): 2.53

You should view the metrics collected by GenAI-Perf, such as request latency, output token throughput, and request throughput. For detailed information on the command line options available with GenAI-Perf, refer to the official documentation.

Conclusion

This post outlined the deployment of the NVIDIA NIM solution with the Llama-3-8B model on Amazon EKS, using Karpenter and AWS services such as Amazon EFS, and Elastic Load Balancing (ELB) to create a scalable and cost-effective infrastructure. Karpenter’s dynamic scaling of worker nodes made sure of efficient resource allocation based on demand. We also benchmarked performance metrics using NVIDIA’s GenAI-Perf tool, showcasing the system’s capabilities.

To streamline the deployment process, Data on EKS provides ready-to-deploy IaC templates, which allow organizations to set up their infrastructure in a few hours. These templates can be customized to fit your specific IaC needs.

To get started scaling and running your data and machine learning (ML) workloads on Amazon EKS, checkout the Data on EKS project on GitHub.

Thank you to all the reviewers who provided valuable feedback and guidance in shaping this blog. Your efforts were instrumental in getting it published. Special thanks to Robert Northard, Bonnie Ng and Apoorva Kulkarni for their time and expertise in the review process. Your contributions are greatly appreciated!

Shawn Zhang

Shawn Zhang

Shawn is a Specialist Solutions Architect at AWS Hong Kong, with expertise in multiple domains, including containers and serverless computing. He is passionate about application architecture and Generative AI solutions, serving as a local advocate for these technologies in Hong Kong. Prior to joining AWS, Shawn gained valuable experience in DevOps and cloud platform engineering with several prominent companies in the region.

Praseeda Sathaye

Praseeda Sathaye

Praseeda Sathaye is a Principal Specialist for App Modernization and Containers at Amazon Web Services, based in the Bay Area in California. She has been focused on helping customers accelerate their cloud-native adoption journey by modernizing their platform infrastructure and internal architecture using microservices strategies, containerization, platform engineering, GitOps, Kubernetes and service mesh. At AWS she is working on AWS services like EKS, ECS and helping strategic customers to run at scale.

Vara Bonthu

Vara Bonthu

Vara Bonthu is a dedicated technology professional and Worldwide Tech Leader for Data on EKS, specializing in assisting AWS customers ranging from strategic accounts to diverse organizations. He is passionate about open-source technologies, Data Analytics, AI/ML, and Kubernetes, and boasts an extensive background in development, DevOps, and architecture. Vara's primary focus is on building highly scalable Data and AI/ML solutions on Kubernetes platforms, helping customers harness the full potential of cutting-edge technology for their data-driven pursuits.