How Vannevar Labs cut ML inference costs by 45% using Ray on Amazon EKS

This blog is authored by Colin Putney (ML Engineer at Vannevar Labs), Shivam Dubey (Specialist SA Containers at AWS), Apoorva Kulkarni (Sr.Specialist SA, Containers at AWS), and Rama Ponnuswami (Principal Container Specialist at AWS).

Vannevar Labs is a defense tech startup, successfully cut machine learning (ML) inference costs by 45% using Ray and Karpenter on Amazon Elastic Kubernetes Service (Amazon EKS). The company specializes in building advanced software and hardware to support various defense missions, including maritime vigilance, misinformation disruption, and nontraditional intelligence collection. Vannevar Labs leverages ML to process information from ingestion systems and perform user-driven tasks such as search and summarization. With a diverse set of models, including fine-tuned open source and in-house trained models, Vannevar Labs embarked on a mission to optimize their ML inference workloads. This optimization journey aimed to enhance their ML infrastructure for improved deployment speed, scalability, and cost-efficiency.

This post explores their approach, challenges, and solutions implemented using Amazon EKS, Ray, and Karpenter, resulting in a 45% reduction in costs and significantly improved performance.

Overview

At Vannevar Labs, we implemented a comprehensive optimization strategy to address some of our key challenges in ML model deployment and performance. By adopting Ray Serve for standardized model serving, using Karpenter for opportunistic instance selection, and using fractional GPUs, we significantly improved scalability, elasticity, and resource usage. We also split a monolithic cluster into specialized model-specific clusters, enhancing deployment efficiency. To ensure optimal performance, we implemented a robust monitoring system using Prometheus, Grafana, and Sentry. Furthermore, we used Istio for advanced traffic management, facilitating smooth transitions and load testing.

Solution implementation

Adopting Ray Serve for standardized inference

Ray Serve is a scalable model serving library for building online inference APIs. We chose Ray Serve to standardize the ML inference process, provide a more structured and efficient approach to handling inference requests. We used KubeRay to deploy Ray clusters on Amazon EKS.

Deploying Ray Serve on Kubernetes combines the scalable compute of Ray Serve with the operational benefits of Kubernetes. For more information on implementing Ray Serve on Kubernetes, see the Ray Serve “Deploy on Kubernetes” topic.

Optimizing instance selection with Karpenter

In the initial implementation, we created separate Karpenter NodePools for each type of model. Each NodePool was limited to a few Amazon Elastic Compute Cloud (Amazon EC2) instance types optimized for that usage profile. Models were assigned to NodePools using taints and tolerations. For example, CPU-intensive models ran in pods that tolerated the CPU NodePool, which only provisioned C5 EC2 instances and not on GPU-based EC2 instances.

After testing, we decided to move to a single NodePool strategy and rejected taints and tolerations as a mechanism for scheduling pods. Instead, we adopted a strategy of providing accurate and comprehensive information to the Ray, Kubernetes, and Karpenter schedulers, giving them a possibility to optimize the EKS cluster for compute and cost efficiency.

For example, for pods that needed GPUs, we implemented node affinity to make sure that the correct GPU type was available. In our embedding model pod specifications we added a Node Affinity for a Well-Known Label to specify the name of the GPU, as shown in the following, and allowed Karpenter to provision the right-sized EC2 instance.

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: karpenter.k8s.aws/instance-gpu-name
              operator: In
              values:
                - t4

For CPU-only pods, we specified that GPU is not required, allowing Kubernetes to schedule these pods on nodes with GPUs without granting access to the GPU. This approach allowed the running of CPU-heavy workloads on GPU-based EC2 instances that were already provisioned, but underused, thus maximizing resource usage on fewer EC2 instances.

resources:
  limits:
    nvidia.com/gpu: "0"

Using fractional GPUs

A key optimization was the use of fractional GPUs in Ray. By carefully analyzing usage patterns, we could bundle multiple Ray actors and tasks into Kubernetes pods, significantly improving GPU usage.

The embedding model calculates the corresponding vector given some text. It uses a fractional GPU, with previous-generation NVIDIA T4 GPUs providing sufficient performance for our purposes.

embedding: {
	cpus:            1.0
	gpus:            0.2
	memory:          2Gi
	minReplicas:     1
	maxReplicas:     80
	gpuType:         "t4"
}

Tuning pod placement and GPU usage

We migrated three models from CPU to GPU-based inference to address performance and cost challenges. The transition created substantial improvements in inference speed and resource efficiency, reducing latency by up to 100x and decreasing the number of required replicas by 20-50x. Despite the higher cost of GPU-based EC2 instances, overall costs were lower because we could run fewer EC2 instances.

Initially, we had many issues converting the models to run on GPUs: memory errors, underusage on multi-GPU EC2 instances, and poor overall GPU usage. The following Kubernetes Node resource has been annotated with extended resources by the NVIDIA device plugin for Kubernetes. This allows us to request resources alongside CPU and memory, and rely on the Kubernetes scheduler to run pods on nodes that have available GPUs. It gives pods exclusive access to specific GPUs. Karpenter relies on a similar mechanism to launch GPU-based EC2 instances when there are pending pods that require them.

apiVersion: v1
kind: Node
metadata:
  labels:
    karpenter.k8s.aws/instance-gpu-count: '8'
    karpenter.k8s.aws/instance-gpu-manufacturer: nvidia
    karpenter.k8s.aws/instance-gpu-memory: '16384'
    karpenter.k8s.aws/instance-gpu-name: t4
    karpenter.sh/capacity-type: on-demand
    kubernetes.io/arch: amd64
    kubernetes.io/os: linux
    node.kubernetes.io/instance-type: g4dn.metal
status:
  allocatable:
    cpu: 95690m
    ephemeral-storage: '246305030964'
    hugepages-1Gi: '0'
    hugepages-2Mi: '0'
    memory: 387183588Ki
    nvidia.com/gpu: '8'
    pods: '737'
  capacity:
    cpu: '96'
    ephemeral-storage: 268423148Ki
    hugepages-1Gi: '0'
    hugepages-2Mi: '0'
    memory: 395848676Ki
    nvidia.com/gpu: '8'
    pods: '737'

Splitting up the monolithic EKS cluster

With the new deployment infrastructure in place, we began creating specialized EKS clusters for each model. This included the creation of customized Docker images for each model instead of using a large all-in-one image. The specialized images ranged from 2 to 12 GB, as compared to the 25 GB monolithic image. This change significantly improved pod launch times and brought substantial cost savings, reducing loading time by 50-90% depending on the image.

A key benefit of using specialized images was the reduction in network ingress costs. The previous all-in-one image didn’t include model weights or Python packages, requiring Ray to download these during container initialization. Although model weights were mostly stored in Amazon S3, incurring mainly time penalties, the Python package downloads had a larger impact. Each new Ray worker node launch triggered a pip install to create the required Python environment, resulting in significant data downloads from PyPI. This contributed to a notable portion of our monthly AWS bill in network incoming charges.

The specialized images included both the model weights and the necessary Python environment. This eliminated the need for downloads from the public internet during container initialization, significantly reducing network incoming traffic and associated costs.

This graph illustrates the dramatic reduction in incoming traffic from October 2023 to June 2024, highlighting the cost-saving impact of the specialized image approach.

Implementing comprehensive monitoring

We set up a monitoring system to make sure of performance measurement and resource allocation. Grafana dashboards are automatically provisioned for each EKS cluster, providing monitoring and alerting capabilities. These dashboards included panels to visualize CPU and memory usage over time, allowing accurate determination of resource requirements. We set resource limits just above the observed maximum values and added a dashboard panel to track incidents where Ray deletes an actor for using too much memory. We also instrumented the model code to report 500 response codes and uncaught errors to Sentry. These measures gave us confidence that we’d be alerted when resource constraints were too low, which allowed us to fine-tune resource allocation for each model. This makes sure good performance while avoiding over-provisioning and increases overall efficiency.

Graph 1. CPU usage of a model over 24 hours

Graph 2. Memory usage of a model over 24 hours

Enhancing traffic management with Istio

We used an Istio Virtual Service to route inference requests to the new Ray clusters. Although we split our monolithic Ray cluster into a separate cluster for each model, the VirtualService provided a unified gateway, making a client migration simpler. Furthermore, traffic was configured to flow to the existing EKS cluster while also being mirrored to experimental EKS clusters for load testing.

Conclusion

By using Amazon EKS, Ray, Karpenter, and Istio, Vannevar Labs dramatically improved our ML inference infrastructure. We reduced deployment time from three hours to only six minutes, improved the ability to handle demand spikes, and achieved a 45% reduction in inference costs.

Key improvements:

Reduced deployment time: from three hours to six minutes.
Improved scalability: some worker groups now scale up in as little as two minutes.
Enhanced elasticity: more frequent scaling down of worker groups leads to substantial cost savings during low-demand periods.
Overall cost reduction: 45% reduction in inference costs.
Better resource utilization: higher overall EKS cluster usage by running CPU workloads on GPU-based EC2 instances when GPU capacity was underused.
Enhanced traffic management: improved traffic routing and load testing capabilities with Istio.

Our next steps include enhancing Kubernetes integration by working with the Ray project to expose scaling data more transparently to Kubernetes. We have also planned a second round of optimizations focusing on Mountpoint for Amazon S3 and other storage-related improvements to further enhance performance and reduce costs.

This case study demonstrates the power of combining modern cloud technologies such as Amazon EKS, Ray, Karpenter, and Istio to optimize ML inference workloads. By carefully analyzing requirements, using the right tools, and continuously iterating on solution setup, organizations can achieve improvements in deployment speed, scalability, and cost-efficiency.

Containers