AWS Machine Learning Blog

Category: AWS Inferentia

Generative AI roadshow in North America with AWS and Hugging Face

In 2023, AWS announced an expanded collaboration with Hugging Face to accelerate our customers’ generative artificial intelligence (AI) journey. Hugging Face, founded in 2016, is the premier AI platform with over 500,000 open source models and more than 100,000 datasets. Over the past year, we have partnered to make it effortless to train, fine-tune, and […]

Gradient makes LLM benchmarking cost-effective and effortless with AWS Inferentia

This is a guest post co-written with Michael Feil at Gradient. Evaluating the performance of large language models (LLMs) is an important step of the pre-training and fine-tuning process before deployment. The faster and more frequent you’re able to validate performance, the higher the chances you’ll be able to improve the performance of the model. […]

Fine-tune and deploy Llama 2 models cost-effectively in Amazon SageMaker JumpStart with AWS Inferentia and AWS Trainium

Today, we’re excited to announce the availability of Llama 2 inference and fine-tuning support on AWS Trainium and AWS Inferentia instances in Amazon SageMaker JumpStart. Using AWS Trainium and Inferentia based instances, through SageMaker, can help users lower fine-tuning costs by up to 50%, and lower deployment costs by 4.7x, while lowering per token latency. […]

Fine-tune Llama 2 using QLoRA and Deploy it on Amazon SageMaker with AWS Inferentia2

In this post, we showcase fine-tuning a Llama 2 model using a Parameter-Efficient Fine-Tuning (PEFT) method and deploy the fine-tuned model on AWS Inferentia2. We use the AWS Neuron software development kit (SDK) to access the AWS Inferentia2 device and benefit from its high performance. We then use a large model inference container powered by […]

Intuitivo achieves higher throughput while saving on AI/ML costs using AWS Inferentia and PyTorch

This is a guest post by Jose Benitez, Founder and Director of AI and Mattias Ponchon, Head of Infrastructure at Intuitivo. Intuitivo, a pioneer in retail innovation, is revolutionizing shopping with its cloud-based AI and machine learning (AI/ML) transactional processing system. This groundbreaking technology enables us to operate millions of autonomous points of purchase (A-POPs) […]

Maximize Stable Diffusion performance and lower inference costs with AWS Inferentia2

Generative AI models have been experiencing rapid growth in recent months due to its impressive capabilities in creating realistic text, images, code, and audio. Among these models, Stable Diffusion models stand out for their unique strength in creating high-quality images based on text prompts. Stable Diffusion can generate a wide variety of high-quality images, including […]

Optimize AWS Inferentia utilization with FastAPI and PyTorch models on Amazon EC2 Inf1 & Inf2 instances

When deploying Deep Learning models at scale, it is crucial to effectively utilize the underlying hardware to maximize performance and cost benefits. For production workloads requiring high throughput and low latency, the selection of the Amazon Elastic Compute Cloud (EC2) instance, model serving stack, and deployment architecture is very important. Inefficient architecture can lead to […]

AWS Inferentia2 builds on AWS Inferentia1 by delivering 4x higher throughput and 10x lower latency

The size of the machine learning (ML) models––large language models (LLMs) and foundation models (FMs)––is growing fast year-over-year, and these models need faster and more powerful accelerators, especially for generative AI. AWS Inferentia2 was designed from the ground up to deliver higher performance while lowering the cost of LLMs and generative AI inference. In this […]

Achieve high performance with lowest cost for generative AI inference using AWS Inferentia2 and AWS Trainium on Amazon SageMaker

The world of artificial intelligence (AI) and machine learning (ML) has been witnessing a paradigm shift with the rise of generative AI models that can create human-like text, images, code, and audio. Compared to classical ML models, generative AI models are significantly bigger and more complex. However, their increasing complexity also comes with high costs […]

How to extend the functionality of AWS Trainium with custom operators

Deep learning (DL) is a fast-evolving field, and practitioners are constantly innovating DL models and inventing ways to speed them up. Custom operators are one of the mechanisms developers use to push the boundaries of DL innovation by extending the functionality of existing machine learning (ML) frameworks such as PyTorch. In general, an operator describes […]