Amazon FSx for Lustre customers

Adobe

Adobe was founded 40 years ago on the simple idea of creating innovative products that change the world, Adobe offers groundbreaking technology that empowers everyone, everywhere to imagine, create, and bring any digital experience to life.

Challenge: Rather than rely on open source models, Adobe decided to train its own foundational generative AI models tailored for creative use cases.

Solution: Adobe created an AI superhighway on AWS to build an AI training platform and data pipelines to rapidly iterate models. Adobe built its solution with Amazon Elastic Compute Cloud (Amazon EC2) P5 and P4d instances powered by NVIDIA GPUs, Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Elastic Block Store (Amazon EBS), and Amazon Elastic Fabric Adapter (EFA). Adobe also used Amazon Simple Storage Service (Amazon S3) to serve as the data lake and primary repository for the vast troves of data. Adobe used Amazon FSx for Lustre high-performance file storage, for fast access to data and to make sure GPU resources are never left idle.

It's easy to think I'll create my own AI cloud, but the partnership with AWS lets us focus on our differentiators

Alexandru Costin - Vice President, Generative AI and Sensei at Adobe

Read the Adobe Case study. »
LG AI Research

LG AI Research Together with world-leading AI experts, LG AI Research aims to lead next epoch of AI to realize the promising future with you by providing optimal research environment and leveraging state-of-the-art AI technologies.

Challenge: LG AI Research needed to deploy its foundation model, EXAONE, to production in one year. EXAONE, which stands for “expert AI for everyone,” is a 300-billion-parameter multi-modal model that uses both images and text data.

Solution: LG AI Research used Amazon SageMaker to train its large-scale foundation model and Amazon FSx for Lustre to distribute data into instances to accelerate model training. LG AI Research needed to deploy its foundation model, EXAONE, to production in one year. LG AI Research successfully deployed EXAONE in one year, and reduced costs by approximately 35 percent by eliminating the need for a separate infrastructure management team.

Read the LG AI Research Case study. »
Paige

Paige is the leading digital pathology transformation provider, offering a full-scale, AI-enabled, web-based solution that brings efficiency and confidence to cancer diagnosis.

Challenge: Paige’s on-premises solutions were maxed out. Their goal was to train AI and ML models to help with cancer pathology. Paige discovered that the more compute capacity that they have, the faster they can train their models and help solve diagnostic problems.

Solution: To run their ML training workloads, Paige selected Amazon EC2 P4d Instances, powered by NVIDIA A100 Tensor Core GPUs, which deliver high performance for ML training and HPC applications in the cloud. Paige uses Amazon FSx for Lustre, fully managed shared storage built on a popular high-performance file system. The company connected this service with some of its Amazon S3 buckets, which helps its development teams address petabytes of ML input data without manually prestaging data on high-performance filesystems. The outcome of the AWS solution is that Paige can train 10x the amount of on-premises data using AWS infrastructure for ML. Paige also experienced 72% faster internal workflows with Amazon EC2 and Amazon FSx for Lustre.

By connecting Amazon FSx for Lustre to Amazon S3, we can train on 10 times the amount of data that we have ever tried in the on-premises infrastructure without any trouble.

Alexander van Eck, staff AI engineer - Paige

Read the case study, Paige Furthers Cancer Treatment Using a Hybrid ML Workflow Built with Amazon EC2 P4d Instances. »
Toyota

Toyota Research Institute chooses FSx for Lustre to reduce object recognition machine learning training times.

Toyota Research Institute (TRI) collects and processes large amounts of sensor data from their autonomous vehicles (AV) test drives. Each training data set is staged in an on-premises NAS device and transferred to Amazon Simple Storage Service (Amazon S3) before processing on a powerful GPU compute cluster. TRI needed a high-performance file system to pair with their compute resources, speed up their ML model training, and accelerate insights for their data scientists.

We needed a parallel file system for our ML training data sets and chose Amazon FSx for Lustre for its higher availability and durability, compared to our legacy file system offering. The integration with AWS services, including S3, also made it the preferred option for our high performance file storage.

David Fluck, Software Engineer - Toyota Research Institute
Shell

Shell offers a dynamic portfolio of energy options – from oil, gas and petrochemicals, to wind, solar and hydrogen – Shell is proud to supply the energy their customers need to power their lives.

Challenge: Shell relies on HPC for model building, testing, and validation. From 2020 to 2022, GPU utilization has averaged less than 90%, resulting in project delays and limitations on new algorithm experimentation.

Solution: Shell augments their on-premises compute capacity by bursting to the cloud with Amazon EC2 clusters and Amazon FSx for Lustre. This solution gives Shell the capability to quickly scale up and down, and only purchase additional compute capacity when needed. Shell’s GPU’s are now fully utilized reducing the cost of compute, and accelerating machine learning model testing.
Storengy

Storengy, a subsidiary of the ENGIE Group, is a leading supplier of natural gas. The company offers gas storage, geothermal solutions, carbon-free energy production, and storage technologies to enterprises worldwide.

To ensure its products are properly stored, Storengy uses high-tech simulators to evaluate underground gas storage, a process that requires extensive use of high-performance computing (HPC) workloads. The company also uses HPC technology to run natural gas discovery and exploration jobs.

Because of AWS, we have the scalability and high availability to perform hundreds of simulations at a time. Additionally, the solution scales automatically up or down to support our peak workload periods, which means we don’t have any surprises with our HPC environment.

Jean-Frederic Thebault – Engineer, Storengy
Smartronix

Smartronix leverages FSx for Lustre to deliver reliable high performance for their SAS Grid deployments.

Smartronix provides cloud solutions, cyber security, systems integration, worldwide C5ISR and data analytics, and mission-focused engineering for many of the world's leading commercial and federal organizations. Smartronix relied on SAS Grid to analyze and deliver state-wide COVID daily statistics, and found their self-managed, parallel file system difficult to administer and protect.

Collaborating with AWS and leveraging their managed solutions like FSx for Lustre has allowed us to serve our customers better – with higher availability and 29% lower cost than self-managed file systems.

Rob Mounier – Senior Solutions Architect, Smartronix
Netflix

Netflix is a streaming service that offers a wide variety of award-winning TV shows, movies, anime, documentaries, and more.

Challenge: Netflix uses large scale distributed training for media ML models, for post-production thumbnails, VFX, and trailer generation for thousands of videos and millions of clips. Netflix was experiencing long waits due to cross-node replication and a 40% GPU idle time.

Solution: Netflix re-architected their data loading pipeline and improved its efficiency by pre-computing all video/audio clips. Netflix also chose Amazon UltraClusters (EC2 P4d instances) to accelerate compute performance. Amazon FSx for Lustre performance enables Netflix to saturate GPU’s, and virtually eliminate GPU idle time. Netflix now experiences a 3-4x improvement using pre-compute and FSx for Lustre, reducing model training time from a week to 1-2 days.

Watch the video: Large-scale distributed training of media ML models with Amazon FSx for Lustre. »
Hyundai

Hyundai Motor Company has risen as a globally recognized automobile manufacturer that exports its branded vehicles to over 200 countries.

Challenge: One of the algorithms often used in autonomous driving is semantic segmentation, which is a task to annotate every pixel of an image with an object class. These classes could be road, person, car, building, vegetation, sky, etc. Hyundai tests accuracy, and gathers additional images to correct for the insufficient predictive performance in specific situations. This can be a challenge, however, as there is often not enough time to prepare all the new data while leaving sufficient time to train the model and meet the scheduled deadlines.

Solution: Hyundai selected Amazon SageMaker to automate model training, and Amazon SageMaker library for data parallelism, to move from a single GPU to distributed training. They chose Amazon FSx for Lustre to train models without waiting for data copies. They also chose Amazon S3 for their permanent data storage. Hyundai achieved up to 93% scaling efficiency with 8 GPU instances, or 64 GPUs in total. FSx for Lustre enabled Hyundai to run multiple training jobs and experiments against the same data with zero wait time.

Read the customer blog post »
Rivian

Rivian is on a mission to keep the world adventurous forever. We believe there is a more responsible way to explore the world and are determined to make the transition to sustainable transportation an exciting one.

To meet accelerated engineering schedules and reduce the need for physical prototypes, electric vehicle manufacturer Rivian relies on advanced modeling and simulation techniques. Using high compute capacity, simulations enable engineers to test new concepts and bring their designs to market quickly.

Partnering with Amazon lets Rivian focus on sustainable vehicle development and delivery, not on IT. And with Amazon, we are running our key development applications faster than on premises including: 56% faster on Elements, 35% faster on Siemens and 20% faster on Ansys.

Madhavi Osanaka, CIO - Rivian

Read the Rivian case study »
DENSO

Denso develops image sensors for advanced driver-assistance systems (ADAS), which help drivers with functions such as parking and changing lanes.

Challenge: To develop the necessary ML models for ADAS image recognition, DENSO had built GPU clusters in its on-premises environment. However, multiple ML engineers shared limited GPU resources, which impacted productivity—especially during the busy period before a new product release.

Solution: By adopting Amazon SageMaker and Amazon FSx for Lustre, Denso was able to accelerate the creation of ADAS image recognition models by reducing the data acquisition, model development, learning, and evaluation time.

“The practice of shifting to the cloud will keep accelerating in the artificial intelligence and ML field. I’m confident that AWS will continue to give us support as we continue adding functions.”

Kensuke Yokoi, general manager - DENSO

Read the Denso Case Study. »
Joby Aviation

Joby Aviation uses AWS to revolutionize transportation.

Challenge: Joby engineers rely on high performance computing (HPC) to conduct thousands of complex, compute-intensive Computational Fluid Dynamics (CFD) simulations that use hundreds of CPU cores each and can take many hours to complete.

Solution: Using Amazon Elastic Compute Cloud (Amazon EC2) and Amazon FSx for Lustre enabled Joby to get faster results from their CFD workloads compared to on-premises high-performance computing infrastructure.

When we tried to run dozens of simulations at one time, we were reading and writing several gigabytes of data at a time, which slowed everything down. FSx for Lustre eliminated those capacity problems. We can increase the size of our hard drive easily now.

Alex Stoll, Aeromechanics Lead, Joby Aviation

Read the Joby Aviation case study »
T-Mobile

T-Mobile realizes $1.5M in annual savings and doubles the speed of SAS Grid workloads using Amazon FSx for Lustre.

Challenge: T-Mobile was experiencing high management overhead and performance difficulties with their self-managed SAS Grid workload.

Solution: T-Mobile deployed Amazon FSx for Lustre, a fully managed high-performance file system, to migrate and scale their SAS Grid infrastructure. T-Mobile utilized the tight integration of Amazon FSx and S3 to reduce their storage overhead and optimize operations.

Amazon FSx for Lustre helped us double the speed of our SAS Grid workloads, reduce our Total Cost of Ownership by 83% and completely eliminate our operational burden. Partnering with AWS enables us to focus on what we do best, developing innovative products for our customers, while relying on the cutting-edge storage features of FSx, and world-class hosting capabilities of AWS.

Dinesh Korde, Senior Manager Software Development - T-Mobile
Netflix

Production of the fourth season of Netflix’s episodic drama “The Crown” faced unexpected challenges, as the world went into lockdown for the COVID-19 pandemic just as post-production VFX work was slated to begin. By adopting a cloud-based workflow on AWS, including Amazon FSx Lustre file server for enhanced throughput, Netflix’s in-house VFX team of 10 artists was able to seamlessly complete more than 600 VFX shots for the season’s 10-episode run in just 8 months, all while working remotely.

Read the "'The Crown' in the Cloud" blog post »
Maxar

Maxar uses AWS to deliver forecasts 58% faster than its weather supercomputer.

Challenge: Maxar Technologies, a trusted partner and innovator in Earth intelligence and Space infrastructure, needed to deliver weather forecasts faster compared to its on-premises supercomputer.

Solution: Maxar worked with AWS to create an HPC solution with key technologies including Amazon Elastic Compute Cloud (Amazon EC2) for secure, highly reliable compute resources, Amazon FSx for Lustre to accelerate the read/write throughput of its application, and AWS ParallelCluster to quickly build HPC compute environments on AWS.

Maxar used Amazon FSx for Lustre in our AWS HPC solution for running NOAA's numerical weather forecasting model. This allowed us to reduce compute time by 58%, generating the forecast in about 45 minutes for a much more cost-effective price point. Maximizing our AWS compute resources was an incredible performance boost for us.

Stefan Cecelski, PhD, Senior Data Scientist & Engineer - Maxar Technologies

Read the Maxar case study »
INEOS TEAM UK

INEOS TEAM UK accelerates boat design for America’s Cup using AWS.

Challenge: Formed in 2018, INEOS TEAM UK aims to bring the America’s Cup—the oldest international sporting trophy in the world—to Great Britain. America’s Cup restricts on-water testing to no more than 150 days prior to the event, so high-performance computational fluid dynamics (CFD) simulations of monohulls and foils become key to a winning boat design.

Solution: Using AWS, INEOS TEAM UK can process thousands of design simulations for its America’s Cup boat in one week versus in more than a month using an on-premises environment. INEOS TEAM UK competed in the 36th edition of the America’s Cup in 2021. The team is using an HPC environment running on Amazon EC2 Spot Instances. To ensure fast disk performance for the thousands of simulations completed each week, the team also used Amazon FSx for Lustre to provide a fast, scalable, and secure high-performance file system based on Amazon Simple Storage Service (S3).

AWS allows us to take bigger design steps, simply because we have more time to understand our results.

Nick Holroyd, Head of Design - INEOS TEAM UK

Read the INEOS TEAM UK case study »
Hive VFX

Hive VFX cuts upfront studio costs, operates as cloud VFX studio on AWS.

Challenge: Hive needed high-performance infrastructure to launch a small, independent cloud studio for remote artists around the world to create quality content.

Solution: Fully managed Amazon FSx for Lustre, integrated with Amazon S3, provided fast access to AWS compute resources without a big upfront investment or in-house IT team expertise. The seamless synchronization of file data and file permissions between FSx Lustre and S3 enabled Hive VFX to store a large volume of images and share project data across continents.

I can spin up an Amazon FSx for Lustre file system in 5 minutes and it's all managed by AWS.

Bernie Kimbacher, Founder - Hive VFX

Read the Hive VFX case study »
Lyell

Lyell accelerates their cell-based cancer treatment research with Amazon FSx for Lustre.

Challenge: Lyell delivers curative, cell-based cancer treatments that require running large scale computational design of proteins. These workloads were traditionally run on premises, but the company needed a more scalable, cost-effective solution as they were limited to running only one experiment per month.

Solution: Since migrating their file system to FSx for Lustre, data scientists can spin-up and spin-down thousands of HPC clusters consisting of EC2 instances and Amazon FSx file systems, enabling them to run processing-heavy experiments rapidly, and only pay for compute and storage for the duration of the workload.

Amazon for FSx Lustre speeds up our research in developing the next generation cancer treatment. Using FSx, we have reduced the execution time of our experiments from weeks down to hours, and enabled scientists to test many more hypotheses than before. Our workloads running on tens of thousands of compute nodes can now use FSx to access S3 data at super-high sets.

Anish Kejariwal, Head of Data Analytics Engineering - Lyell Immunopharma
BlackThorn Therapeutics

BlackThorn Therapeutics accelerates time-to-insight with FSx for Lustre.

Challenge: Processing magnetic resonance imaging (MRI) data using standard DiY cloud file systems was resource- and time-intensive. BlackThorn needed a compute-intensive, shared file storage solution to help simplify their data science and machine learning workflows.

Solution: Amazon FSx for Lustre is integrated with Amazon S3 and Amazon SageMaker, providing fast processing for their ML training data sets as well as seamless access to compute using Amazon EC2 instances.

FSx for Lustre has enabled us to create a high-performance MRI data processing pipeline. Data processing time for our ML-based workflows was cut down to minutes compared to days and weeks.

Oscar Rodriguez, Senior Director, Innovation & Technology - BlackThorn Therapeutics
Qubole

Qubole improves data durability while reducing cost with Amazon FSx for Lustre.

Challenge: Qubole was seeking a high-performance storage solution to process analytical and AI/ML workloads for their customers. They needed to easily store and process the intermediate data held in their EC2 Spot Fleet.

Solution: Qubole used Amazon FSx for Lustre to store and process intermediate data through its parallel, high-speed file system.

Our users’ two biggest problems, high costs and intermediate data loss, stemmed from using idle EC2 instances and EC2 Spot instances to process and store intermediate data generated by distributed processing frameworks like Hive and Spark. We were able to solve this problem by using Amazon FSx for Lustre, a highly performant file system, to offload intermediate data. Now our users do not have to pay to maintain idle instances and are not affected by interrupted EC2 Spot nodes. Amazon FSx helped our users reduce total costs by 30%.

Joydeep Sen Sarma, CTO - Qubole