AWS HPC Blog
Tag: ML
EFA: how fixing one thing, led to an improvement for … everyone
Today, we’re diving deep into the open-source frameworks that move MPI messages around, and showing you how work we did in the Open MPI and libfabrics community lead to an improvement for EFA users – and everyone else, too.
Conceptual design using generative AI and CFD simulations on AWS
In this post we’ll show how generative AI, combined with conventional physics-based CFD can create a rapid design process to explore new design concepts in automotive and aerospace from just a single image.
How Amazon’s Search M5 team optimizes compute resources and cost with fair-share scheduling on AWS Batch
In this post, we share how Amazon Search optimizes their use of accelerated compute resources using AWS Batch fair-share scheduling to schedule distributed deep learning workloads.
How computer vision is enabling a circular economy
In this post, we show how Reezocar uses computer vision to change the way they detect damage and price used vehicles for re-sale in secondary markets. This reduces landfill and helps achieve the goals of the circular economy.
Improving NFL player health using machine learning with AWS Batch
In this post we’ll show you how the NFL used AWS to scale their ML workloads and produce the first comprehensive dataset of helmet impacts across multiple NFL seasons. They were able to reduce manual labor by 90% and the results beats human labelers in accuracy by 12%!
How to make digital technologies for the circular economy work for your business
In this post, we discuss the benefits of digital technology for the circular economy, and show how businesses can implement these technologies to get the most out of them for the wellbeing of everyone.
Streamlining distributed ML workflow orchestration using Covalent with AWS Batch
Complicated multi-step workflows can be challenging to deploy, especially when using a variety of high-compute resources. Covalent is an open-source orchestration tool that streamlines the deployment of distributed workloads on AWS resources. In this post, we outline key concepts in Covalent and develop a machine learning workflow for AWS Batch in just a handful of steps.
Introducing GPU health checks in AWS ParallelCluster 3.6
AWS ParallelCluster 3.6.0 can now detect GPU failures in HPC and AI/ML tasks. Health checks run at the start of Slurm jobs and if they fail, the job is requeued on another instance. This can increase reliability and prevent wasted spend.
Second generation EFA: improving HPC and ML application performance in the cloud
Since launch, EFA has seen continuous improvements in performance. In this post, we talk about our 2nd generation of EFA, which takes another step in improving Machine Learning and High Performance Computing in the Cloud.
Launch self-supervised training jobs in the cloud with AWS ParallelCluster
In this post we describe the process to launch large, self-supervised training jobs using AWS ParallelCluster and Facebook’s Vision Self-Supervised Learning (VISSL) library.