Why Trainium?
AWS Trainium chips are a family of AI chips purpose built by AWS for AI training and inference to deliver high performance while reducing costs.
The first-generation AWS Trainium chip powers Amazon Elastic Compute Cloud (Amazon EC2) Trn1 instances, which have up to 50% lower training costs than comparable Amazon EC2 instances. Many customers, including Databricks, Ricoh, NinjaTech AI, and Arcee AI, are realizing performance and cost benefits of Trn1 instances.
AWS Trainium2 chip delivers up to 4x the performance of first-generation Trainium. Trainium2-based Amazon EC2 Trn2 instances are purpose-built for generative AI are the most powerful EC2 instances for training and deploying models with hundreds of billions to trillion+ parameters. Trn2 instances offer 30-40% better price performance than the current generation of GPU-based EC2 P5e and P5en instances. Trn2 instances feature 16 Trainium2 chips interconnected with NeuronLink, our proprietary chip-to-chip interconnect. You can use Trn2 instances to train and deploy the most demanding models including large language models (LLMs), multi-modal models, and diffusion transformers, to build a broad set of next-generation generative AI applications. Trn2 UltraServers, a completely new EC2 offering (available in preview), are ideal for the largest models that require more memory and memory bandwidth than standalone EC2 instances can provide. The UltraServer design uses NeuronLink to connect 64 Trainium2 chips across four Trn2 instances into one node, unlocking new capabilities. For inference, UltraServers help deliver industry-leading response time to create the best real-time experiences. For training, UltraServers boost model training speed and efficiency with faster collective communication for model parallelism as compared to standalone instances.
You can get started training and deploying models on Trn2 and Trn1 instances with native support for popular machine learning (ML) frameworks such as PyTorch and JAX.