AWS Machine Learning Blog
Category: Amazon Elastic Kubernetes Service
Deploy Meta Llama 3.1-8B on AWS Inferentia using Amazon EKS and vLLM
In this post, we walk through the steps to deploy the Meta Llama 3.1-8B model on Inferentia 2 instances using Amazon EKS. This solution combines the exceptional performance and cost-effectiveness of Inferentia 2 chips with the robust and flexible landscape of Amazon EKS. Inferentia 2 chips deliver high throughput and low latency inference, ideal for LLMs.
Introducing Amazon EKS support in Amazon SageMaker HyperPod
This post is designed for Kubernetes cluster administrators and ML scientists, providing an overview of the key features that SageMaker HyperPod introduces to facilitate large-scale model training on an EKS cluster.
Accelerate your generative AI distributed training workloads with the NVIDIA NeMo Framework on Amazon EKS
In today’s rapidly evolving landscape of artificial intelligence (AI), training large language models (LLMs) poses significant challenges. These models often require enormous computational resources and sophisticated infrastructure to handle the vast amounts of data and complex algorithms involved. Without a structured framework, the process can become prohibitively time-consuming, costly, and complex. Enterprises struggle with managing […]
Scale and simplify ML workload monitoring on Amazon EKS with AWS Neuron Monitor container
Amazon Web Services is excited to announce the launch of the AWS Neuron Monitor container, an innovative tool designed to enhance the monitoring capabilities of AWS Inferentia and AWS Trainium chips on Amazon Elastic Kubernetes Service (Amazon EKS). This solution simplifies the integration of advanced monitoring tools such as Prometheus and Grafana, enabling you to […]
Get started quickly with AWS Trainium and AWS Inferentia using AWS Neuron DLAMI and AWS Neuron DLC
Starting with the AWS Neuron 2.18 release, you can now launch Neuron DLAMIs (AWS Deep Learning AMIs) and Neuron DLCs (AWS Deep Learning Containers) with the latest released Neuron packages on the same day as the Neuron SDK release. When a Neuron SDK is released, you’ll now be notified of the support for Neuron DLAMIs […]
Scale AI training and inference for drug discovery through Amazon EKS and Karpenter
This is a guest post co-written with the leadership team of Iambic Therapeutics. Iambic Therapeutics is a drug discovery startup with a mission to create innovative AI-driven technologies to bring better medicines to cancer patients, faster. Our advanced generative and predictive artificial intelligence (AI) tools enable us to search the vast space of possible drug […]
Open source observability for AWS Inferentia nodes within Amazon EKS clusters
This post walks you through the Open Source Observability pattern for AWS Inferentia, which shows you how to monitor the performance of ML chips, used in an Amazon Elastic Kubernetes Service (Amazon EKS) cluster, with data plane nodes based on Amazon Elastic Compute Cloud (Amazon EC2) instances of type Inf1 and Inf2.
Scale LLMs with PyTorch 2.0 FSDP on Amazon EKS – Part 2
This is a guest post co-written with Meta’s PyTorch team and is a continuation of Part 1 of this series, where we demonstrate the performance and ease of running PyTorch 2.0 on AWS. Machine learning (ML) research has proven that large language models (LLMs) trained with significantly large datasets result in better model quality. In […]
Federated learning on AWS using FedML, Amazon EKS, and Amazon SageMaker
This post is co-written with Chaoyang He, Al Nevarez and Salman Avestimehr from FedML. Many organizations are implementing machine learning (ML) to enhance their business decision-making through automation and the use of large distributed datasets. With increased access to data, ML has the potential to provide unparalleled business insights and opportunities. However, the sharing of […]
Enable pod-based GPU metrics in Amazon CloudWatch
This post details how to set up container-based GPU metrics and provides an example of collecting these metrics from EKS pods.