What is SageMaker HyperPod?
Amazon SageMaker HyperPod removes the undifferentiated heavy lifting involved in building generative AI models. It helps quickly scale model development tasks such as training, fine-tuning, or inference across a cluster of hundreds or thousands of AI accelerators. SageMaker HyperPod enables centralized governance across all your model development tasks, giving you full visibility and control over how different tasks are prioritized, and how compute resources are allocated to each task, helping you maximize GPU and AWS Trainium utilization of your cluster and accelerate innovation.
With SageMaker HyperPod, you can efficiently distribute and parallelize your training workload across all accelerators. SageMaker HyperPod automatically applies the best training configurations for popular publicly available models, to help you quickly achieve optimal performance. It also continually monitors your cluster for any infrastructure faults, automatically repairs the issue, and recovers your workloads without human intervention—all of which help save you up to 40% of training time.
Benefits of SageMaker HyperPod
Introducing task governance in SageMaker HyperPod
Maximize utilization and gain full visibility of compute resources, all while reducing costs.