Amazon SageMaker HyperPod customers

Top AI startups and organizations of all sizes are training and deploying foundation models at scale on SageMaker HyperPod

Hugging Face

Watch the video

Hugging Face has been using SageMaker HyperPod to create important new open foundation models like StarCoder, IDEFICS, and Zephyr which have been downloaded millions of times. SageMaker HyperPod’s purpose-built resiliency and performance capabilities have enabled our open science team to focus on innovating and publishing important improvements to the ways foundation models are built, rather than managing infrastructure. We especially liked how SageMaker HyperPod is able to detect ML hardware failure and quickly replace the faulty hardware without disrupting ongoing model training. Because our teams need to innovate quickly, this automated job recovery feature helped us minimize disruption during the foundation model training process, helping us save hundreds of hours of training time in just a year.

Jeff Boudier, head of Product at Hugging Face
Perplexity AI

Read the case study

We were looking for the right ML infrastructure to increase productivity and reduce costs in order to build high-performing large language models. After running a few successful experiments, we switched to AWS from other cloud providers in order to use Amazon SageMaker HyperPod. We have been using HyperPod for the last four months to build and fine-tune the LLMs to power the Perplexity conversational answer engine that answers questions along with references provided in the form of citations. Because SageMaker HyperPod automatically monitors cluster health and remediates GPU failures, our developers are able to focus on model building instead of spending time on managing and optimizing the underlying infrastructure. SageMaker HyperPod’s built-in data and model parallel libraries helped us optimize training time on GPUs and double the training throughput. As a result, our training experiments can now run twice as fast, which means our developers can iterate more quickly, accelerating the development of new generative AI experiences for our customers.

Aravind Srinivas, co-founder and CEO at Perplexity AI
Articul8 AI

Read the case study

Amazon SageMaker HyperPod has helped us tremendously in managing and operating our computational resources more efficiently with minimum downtime. We were early adopters of the Slurm-based HyperPod service and have benefitted from its ease-of-use and resiliency features, resulting in up to 35% productivity improvement and rapid scale up of our GenAI operations. As a Kubernetes house, we are now thrilled to welcome the launch of Amazon EKS support for SageMaker HyperPod. This is a game changer for us as it integrates seamlessly with our existing training pipelines and makes it even easier for us to manage and operate our large-scale Kubernetes clusters. In addition, this also helps our end customers as we are now able to package and productize this capability into our GenAI platform, enabling our customers to run their own training and finetuning workloads in a more streamlined manner.

Arun Subramaniyan, Founder and CEO of Articul8 AI
Thomson Reuters

Read the blog

Thomson Reuters, a global AI and content-driven technology company, has been testing the task governance capability in Amazon SageMaker HyperPod to address a key challenge around workload prioritization. With task governance, now they can manage customer workloads such as inference requests alongside their own ongoing model development projects, ensuring prioritizing urgent customer requests without disrupting internal research, leading to better resource utilization and customer satisfaction.
Thomson Reuters

Thomson Reuters has been at the forefront of AI development for over 30 years, and we are committed to providing meaningful solutions that help our customers deliver results faster, with better access to trusted information. To accelerate our innovation in generative AI, in addition to partnering with LLM providers, we also are exploring training custom models more efficiently with our unique and proprietary content and human expertise. SageMaker HyperPod’s distributed training libraries helps us improve large scale model training performance. And its resiliency feature saves time as we monitor and manage infrastructure. Training our foundation models on SageMaker HyperPod will increase our speed to market and help us provide quality solutions for our customers at pace.

Joel Hron, Head of AI and Labs, Thomson Reuters
Thomson Reuters

We were able to meet our large language model training requirements using Amazon SageMaker HyperPod. Using Amazon EKS on SageMaker HyperPod, we were able to scale up capacity and easily run training jobs, enabling us to unlock benefits of LLMs in areas such as legal summarisation and classification.

John Duprey, Distinguished Engineer, Thomson Reuters Labs
Stability AI

Read the blog

As the leading open-source generative AI company, our goal is to maximize the accessibility of modern AI. We are building foundation models with tens of billions of parameters, which require the infrastructure that can scale optimized training performance. With SageMaker HyperPod’s managed infrastructure and optimization libraries, we can reduce training time and costs by over 50%. It makes our model training more resilient and performant to build state-of-the-art models faster.

Emad Mostaque, Founder and CEO, Stability AI
Observea

As a fast-moving startup and AI research company, Amazon EKS support in SageMaker HyperPod have been instrumental to accelerating our time-to-market. With SageMaker HyperPod, we have been able to launch a stable and secure platform to offer containerized high-performance-compute (HPC) applications as a service to our end customers which include top University AI research programs, AI startups and traditional enterprises. Through our use of SageMaker HyperPod, our customers and internal teams no longer have to worry about operating and configuring the Kubernetes control plane, and SageMaker HyperPod provides the network performance and optimized configurations to support complex HPC workloads. With EKS Support in SageMaker HyperPod, we can reduce time we spent for undifferentiated heavy lifting in infrastructure management and reduce operational costs by over 30%.

Vamsi Pandari, Founder of Observea
Recursal AI

The whole process was streamlined. Using SageMaker HyperPod, we can take advantage of cluster resiliency features that identify and automatically recover training jobs from the last saved checkpoint in the event of a hardware failure. We run very diverse workloads - from application, inference and training - with Kubernetes as the common thread. For us, Amazon EKS with SageMaker HyperPod just works: the nodes just drop into our cluster.

Nathan Wilce, Infrastructure/data lead, Recursal
Hippocratic AI

Hippocratic AI, an AI company that develops the first safety-focused Large Language Model (LLM) for healthcare. To train its primary LLM and the supervisor models, Hippocratic AI required powerful compute resources, which were in high demand and difficult to obtain. Amazon SageMaker HyperPod flexible training plans made it easier for them to gain access to Amazon Elastic Compute Cloud (Amazon EC2) P5 Instances. HippocraticAI is also leveraging AWS services such as Grafana to track important GPU utilization metrics. Using Amazon EC2 P5 Instances, Hippocratic AI has increased model training speed by four times and scales its solution to accommodate hundreds of use cases. It helped them to secure the required compute resources and train models quickly.
Articul8

Amazon SageMaker HyperPod task governance helps maximize GPU utilization across various teams and projects. As a fast-growing GenAI startup, Articul8 AI constantly optimizes their compute environment to allocate accelerated compute resources as efficiently as possible. With automated task prioritization and resource allocation in SageMaker HyperPod, they have seen a dramatic improvement in GPU utilization, thereby reducing idle time and accelerating their model development process by optimizing tasks ranging from training and fine-tuning to inference. The ability to automatically shift resources to high-priority tasks has increased their team's productivity allowing them to bring new GenAI innovations to market faster than ever before.
NinjaTech

NinjaTech AI, a generative AI company that provides an all-in-one SuperAgent for unlimited productivity, used Amazon SageMaker HyperPod flexible training plans to accelerate fine-tuning of various internal models including the Llama 3.1 405B model to reduce model training costs, and automate the process. The company aims to provide a seamless experience to its users wanting access to various AI agents powering their SuperAgent Technology. To achieve this, they needed a model that could automatically predict user intention and determine which AI agent would be a good fit for it. This mechanism required making frequent updates to the model by incorporating customer feedback and new features iteratively, involving 10m-100m tokens at each round of LoRA fine-tuning. As a startup, acquiring and operating high-performance compute resources is challenging due to its steep cost and bandwidth issues, specifically in multi-node clusters which involve fast network and fast storage in addition to accelerated computing. In addition, the training process is time-consuming, involving steps like model downloading, distributed training, checkpoint, monitoring, auto remediation, merging, and quantization. HyperPod’s flexible training plans provided the company with reliable and affordable compute in advance of the training run, matching their specific compute and timeline requirements, while ensuring efficient model training.
OpenBabylon

Developers and data scientists at OpenBabylon, an AI company that customizes large language models for underrepresented languages, has been using SageMaker HyperPod flexible training plans for a few months to streamline their access to GPU resources to run large scale experiments. Using the multi-node SageMaker HyperPod’s distributed training capabilities, they conducted 100 large scale model training experiments, achieving state-of-the-art results in English-to-Ukrainian translation. This breakthrough was achieved on time and cost-effectively, demonstrating SageMaker HyperPod’s ability to successfully deliver complex projects on time and at budget.
Salesforce

Researchers at Salesforce were looking for ways to quickly get started with foundational model training and fine-tuning, without having to worry about the infrastructure, or spend weeks optimizing their training stack for each new model. With Amazon SageMaker HyperPod recipes, researchers at Salesforce can conduct rapid prototyping when customizing FMs. Now, Salesforce’s AI Research teams are able to get started in minutes with a variety of pre-training and fine-tuning recipes, and can operationalize frontier models with high performance.

Amazon SageMaker HyperPod partners

Drive innovation and unlock greater business value with AWS partners that have deep technical knowledge and proven customer success

Accenture

We are extending our partnership with AWS as a launch partner for Amazon SageMaker HyperPod task governance. Our collaboration with AWS will allow us to guide customers towards the latest technological breakthroughs while helping to reduce generative AI application costs. By bringing together centralized governance capabilities in SageMaker HyperPod, and our experience in generative AI projects, we can help companies realize the value of generative AI even faster, improving customer experience and increasing return on investment.

Jennifer Jackson, Global Lead for Accenture AWS Business Group & Senior Managing Director
Slalom

We are thrilled to collaborate with AWS as a launch partner for Amazon SageMaker HyperPod task governance. Working with AWS, we can now help our customers rapidly adopt the latest technological advancements and reduce the costs of their generative AI applications. By bringing together centralized governance capabilities in SageMaker HyperPod, with Slalom’s extensive AI and cloud experience, we can deliver exceptional customer experiences alongside increased return on investment.

Jeff Kempiners, Managing Director of Slalom’s Amazon Center of Excellence (CoE)
Rackspace Technology

We are excited to collaborate with AWS as a launch partner for SageMaker HyperPod task governance. Together, we can help our customers reduce the costs of generative AI applications, while keeping up with the latest technological advancements. By combining SageMaker HyperPod’s centralized governance capabilities with Rackspace’s deep AI and cloud expertise, we can transform customer experiences and improve their return on investment simultaneously.

Srini Koushik, President, AI, Technology and Sustainability at Rackspace Technology

Amazon SageMaker HyperPod customers

Hugging Face

Perplexity AI

Articul8 AI

Thomson Reuters

Thomson Reuters

Thomson Reuters

Stability AI

Observea

Recursal AI

Hippocratic AI

Articul8

NinjaTech

OpenBabylon

Salesforce

Amazon SageMaker HyperPod partners

Accenture

Slalom

Rackspace Technology

Ending Support for Internet Explorer