AWS Machine Learning Blog

Category: Amazon SageMaker HyperPod

How Fastweb fine-tuned the Mistral model using Amazon SageMaker HyperPod as a first step to build an Italian large language model

Fastweb, one of Italy’s leading telecommunications operators, recognized the immense potential of AI technologies early on and began investing in this area in 2019. In this post, we explore how Fastweb used cutting-edge AI and ML services to embark on their LLM journey, overcoming challenges and unlocking new opportunities along the way.

Solution overview

Implementing login node load balancing in SageMaker HyperPod for enhanced multi-user experience

In this post, we explore a solution for implementing load balancing across login nodes in Slurm-based HyperPod clusters. By distributing user activity evenly across all available nodes, this approach provides more consistent performance, better resource utilization, and a smoother experience for all users. We guide you through the setup process, providing practical steps to achieve effective load balancing in your HyperPod clusters.

Generative AI foundation model training on Amazon SageMaker

Generative AI foundation model training on Amazon SageMaker

In this post, we explore how organizations can cost-effectively customize and adapt FMs using AWS managed services such as Amazon SageMaker training jobs and Amazon SageMaker HyperPod. We discuss how these powerful tools enable organizations to optimize compute resources and reduce the complexity of model training and fine-tuning. We explore how you can make an informed decision about which Amazon SageMaker service is most applicable to your business needs and requirements.

Scalable training platform with Amazon SageMaker HyperPod for innovation: a video generation case study

In this post, we share an ML infrastructure architecture that uses SageMaker HyperPod to support research team innovation in video generation. We will discuss the advantages and pain points addressed by SageMaker HyperPod, provide a step-by-step setup guide, and demonstrate how to run a video generation algorithm on the cluster.

Accelerate pre-training of Mistral’s Mathstral model with highly resilient clusters on Amazon SageMaker HyperPod

In this post, we present to you an in-depth guide to starting a continual pre-training job using PyTorch Fully Sharded Data Parallel (FSDP) for Mistral AI’s Mathstral model with SageMaker HyperPod.

Enabling production-grade generative AI: New capabilities lower costs, streamline production, and boost security

Enabling production-grade generative AI: New capabilities lower costs, streamline production, and boost security

As generative AI moves from proofs of concept (POCs) to production, we’re seeing a massive shift in how businesses and consumers interact with data, information—and each other. In what we consider “Act 1” of the generative AI story, we saw previously unimaginable amounts of data and compute create models that showcase the power of generative […]

Scaling Thomson Reuters’ language model research with Amazon SageMaker HyperPod

In this post, we explore the journey that Thomson Reuters took to enable cutting-edge research in training domain-adapted large language models (LLMs) using Amazon SageMaker HyperPod, an Amazon Web Services (AWS) feature focused on providing purpose-built infrastructure for distributed training at scale.

Introducing Amazon EKS support in Amazon SageMaker HyperPod

Introducing Amazon EKS support in Amazon SageMaker HyperPod

This post is designed for Kubernetes cluster administrators and ML scientists, providing an overview of the key features that SageMaker HyperPod introduces to facilitate large-scale model training on an EKS cluster.

Integrate HyperPod clusters with Active Directory for seamless multi-user login

Amazon SageMaker HyperPod is purpose-built to accelerate foundation model (FM) training, removing the undifferentiated heavy lifting involved in managing and optimizing a large training compute cluster. With SageMaker HyperPod, you can train FMs for weeks and months without disruption. Typically, HyperPod clusters are used by multiple users: machine learning (ML) researchers, software engineers, data scientists, […]