Amazon SageMaker HyperPod

Scale and accelerate generative AI model development across thousands of AI accelerators

What is SageMaker HyperPod?

Amazon SageMaker HyperPod removes the undifferentiated heavy lifting involved in building generative AI models. It helps quickly scale model development tasks such as training, fine-tuning, or inference across a cluster of hundreds or thousands of AI accelerators. SageMaker HyperPod enables centralized governance across all your model development tasks, giving you full visibility and control over how different tasks are prioritized, and how compute resources are allocated to each task, helping you maximize GPU and AWS Trainium utilization of your cluster and accelerate innovation.

With SageMaker HyperPod, you can efficiently distribute and parallelize your training workload across all accelerators. SageMaker HyperPod automatically applies the best training configurations for popular publicly available models, to help you quickly achieve optimal performance. It also continually monitors your cluster for any infrastructure faults, automatically repairs the issue, and recovers your workloads without human intervention—all of which help save you up to 40% of training time.

Benefits of SageMaker HyperPod

The SageMaker HyperPod task governance innovation provides full visibility and control over compute resource allocation across generative AI model development tasks, such as training and inference. SageMaker HyperPod automatically manages tasks queues, ensuring the most critical tasks are prioritized and completed on time and within budget, while more efficiently using compute resources to reduce model development costs by up to 40%.
With SageMaker HyperPod recipes, data scientists and developers of all skill sets benefit from state-of-the-art performance while getting started with training and fine-tuning publicly available generative AI models in minutes. HyperPod also provides built-in experimentation and observability tools to help you enhance model performance.
SageMaker HyperPod allows you to automatically split your models and training datasets across AWS cluster instances to help you efficiently scale training workloads. It helps you to optimize your training job for AWS network infrastructure and cluster topology. It also streamlines model checkpointing via the recipes by optimizing the frequency of saving checkpoints, ensuring minimum overhead during training.
SageMaker HyperPod provides a resilient environment for model development by automatically detecting, diagnosing, and recovering from infrastructure faults, allowing you to continually run model development workloads for months without disruption.

Introducing task governance in SageMaker HyperPod

Maximize utilization and gain full visibility of compute resources, all while reducing costs.

Learn more