Improve HPC workloads on AWS for environmental sustainability

Improve HPC workloads on AWS for environmental sustainability HPC workloads are used across industries from finance and engineering to genomics and chemistry – all to enable opportunities for innovation and growth. Because the compute is so intensive, these systems can consume large amounts of energy and generate serious carbon emissions. Many businesses are looking to reduce their carbon footprint to meet the Paris accord, and on-premises HPC clusters constitute a noticeable portion of their total data center impact. But by migrating their HPC workloads to AWS Baker Hughes was able to reduce their carbon footprint by 99%, setting an example for the community.

If you want to achieve a similar result, it’s critical to optimize your cluster for resource efficiency and effectiveness. HPC architectures are built on five main pillars: compute, storage, networking, visualization and orchestration.

In this post, we’ll focus on each one of these pillars, and discuss tips and best practices that can help you optimize resource usage and the environmental impact of your workloads, while also keeping your productivity goals.

To demonstrate these ideas, we’ll assume the use case of an automotive customer who needs to run computer-aided engineering (CAE). These types of jobs need large volumes of compute, so they’re representative of most simulations you’re likely to be doing in your environment, too.

Compute

Choose the right instance type for your HPC workload

In an on-premises cluster, the computing nodes are usually homogeneous – they have the same CPU type and the quantity of memory per core. However, HPC workloads don’t all have the same requirements. For example, structural analysis can benefit from having a low latency NVMe scratch disk to access, computational fluid dynamics (CFD) jobs usually span across a greater number of cores, and finite element analysis (FEA) simulations need more memory per core, compared to other types of simulations.

To reduce your carbon footprint, you need to optimize your consumption of computing resources, so it’s important to match your workload to the right compute instance types.

In AWS there are more than 750 types of Amazon Elastic Compute Cloud (Amazon EC2) instances to choose from, each one with different hardware characteristics. Recently, we launched a new family of instances dedicated to HPC workloads.

For our CAE use case, you can use the new Amazon EC2 Hpc7a instance. We’ve seen them perform very well for workloads that can be accelerated by increasing the number of compute cores. You can read about some of our benchmark tests in another post.

Use AWS Graviton instances if you can

AWS Graviton is a custom-designed processor created by AWS, based on the Arm64 architecture and which provides a balance of efficiency, performance, and cost. This makes it a good fit to lower the environmental impact of many HPC workloads. AWS Graviton-based Amazon EC2 instances use up to 60% less energy than comparable EC2 instances for the same performance. If your HPC application can be compiled for Arm64, you can use Graviton instances to enable greater energy efficiency for your compute intensive workloads.

Hpc7g Instances are powered by AWS Graviton3E processors, with up to 35% higher vector instruction processing performance than the Graviton3. They are designed to give you the best price/performance for tightly coupled compute-intensive HPC and distributed computing workloads, and deliver 200 Gbps of dedicated network bandwidth that’s optimized for traffic between instances in the same subnet.
For CAE, customers can take advantage of open source CFD applications, like OpenFOAM, that can be compiled for Arm architecture processors and are well suited for the hpc7g instance type. But ISVs like Siemens, Cadence, and Ansys, now make versions of their solvers available for Graviton in the same way they do for x86.

The table below summarizes some of the key HPC instance types with their characteristics, and provides a recommendation of the target workload.

Table 1 – Comparison of Amazon EC2 HPC specific instance types showing example workloads and attributes

Choose the right AWS Region for your HPC workload

AWS Well-Architected framework helps customers to build secure, highly-performant, resilient, and efficient architecture, and is built around six pillars: operational excellence, security, reliability, performance efficiency, cost optimization, and sustainability. The framework provides a consistent approach for customers and partners to evaluate architectures and implement scalable designs.

The SUS01-BP01 best practices in the AWS Well-Architected Sustainability pillar, recommends that you choose the AWS Regions for your workload based on both business requirements and sustainability goals.
As explained in another blog post, this process includes two key steps:

Assess and shortlist potential Regions for your workload based on your business requirements.
Choose Regions near Amazon renewable energy projects and Region(s) where the grid has a lower published carbon intensity.

As explained in the previous step, for the sample CAE workload, Hpc7a instances can accelerate this workload and help you to achieve your business goals.

By using this simple script, you can assess that this instance type is available, for example, in the Stockholm Region (eu-north-1):

#!/bin/sh
instance_type=hpc7a.96xlarge

# Get a list of all AWS regions
regions=$(aws ec2 describe-regions --all-regions --query 'Regions[].RegionName' --output text)

echo "AWS Regions where $instance_type is available:"

# Loop through each region and check if the instance type is available
for region in $regions; do
  available=$(aws ec2 describe-instance-type-offerings --filters Name=instance-type,Values=$instance_type --region $region --query 'InstanceTypeOfferings[].InstanceType' --output text 2>/dev/null)

  if [ "$available" = "$instance_type" ]; then
    echo "- $region"
  fi
done

Moreover, the electricity consumed in the Stockholm Region is attributable to 100% renewable energy as listed on the Amazon sustainability website. So, this Region is a good candidate for running your CAE workloads, and also achieving your sustainability goals.

Use a mix of EC2 purchase options

HPC workloads usually have different resource requirements and business priorities. You can use a combination of purchase options to address your specific business needs, balancing instance flexibility, scalability, and efficiency.

Use Compute Savings Plans for predictable, steady-state workloads that allow flexibility if your needs (like target Region, instance families, or instance types) change frequently.
Use On-Demand Instances for new, and unpredictable workloads
Use Spot Instances to supplement the other options for applications that are fault tolerant and flexible.

Our example use case may require periods of intense activity to meet project milestones or client deadlines, where the elasticity of the cloud facilitates access to large amounts of On-Demand compute resources in order to meet tight deadlines. AWS HPC orchestration tools like AWS Batch and AWS ParallelCluster both have support for Spot, On-demand and Saving Plans resources.

Storage

AWS provides several types of storage and file system technologies that can be used for HPC workloads. Similar to compute, optimizing your storage choices can help you to accelerate your workload and help to reduce your carbon emissions.

Every workload has a different data strategy, but we can split them across different stages:

On-premises storage

Amazon’s goal is to power our operations with 100% renewable energy by 2025 – five years ahead of our original 2030 target, so migrating your entire portfolio of workloads to AWS is usually a great option to reduce your carbon footprint.

In some cases, this is not possible because your data is generated from on-site instruments or labs. In this case, you need an efficient mechanism to move data to AWS to process in the cloud. Some of our managed file system offerings, such as Amazon File Cache, allow you to mount the on-premises file system to the cloud. By doing that, you’ll avoid unnecessarily replicating your data in the cloud and the results of your simulations will also be accessible on-premises.

Local Disk

Use EBS volumes to store your data on the computing nodes. If your applications need access to high-speed, low-latency local storage for scratch files, you can use local NVMe-based SSD block storage physically connected to the host server. Many Amazon EC2 instances – like the Hpc6id – have multiple NVMe disks to accelerate the performance of the most IO demanding workloads. Keep in mind that these local disks are ephemeral, and their content will be automatically erased when you stop the instances. You need to identify which files you need to save when the job has been completed and move them to a persistent storage area, like a shared POSIX file system or Amazon S3.

Shared POSIX file system

If your applications need to access a shared file system, you can use one of the AWS managed file systems. For example, Amazon FSx for Lustre is a fully-managed service that provides high-performance storage for compute workloads. Powered by Lustre, the world’s most popular high-performance file system, FSx for Lustre offers shared storage with sub-millisecond latencies, up to terabytes per second of throughput, and millions of IOPS.

Long term storage

Once the jobs are completed, you can migrate your data to a long-term storage in Amazon S3. This will help you to reduce the amount of data in the shared file system. FSx for Lustre also integrates with Amazon S3 via the Data Repository mechanism that allows seamless access to objects in Amazon S3 that can be lazy loaded into the high-performance file system layer. This approach delivers the required access semantics and performance for scalable HPC applications when they’re needed, while also providing the usual capacity, cost, data protection, lifecycle, and sustainability benefits of Amazon S3.

Use life cycle policies to move and delete data

HPC workloads have datasets with different retention and access requirements. For example, your HPC application may need frequent access to some datasets for a limited period of time. After that, those datasets can be archived or deleted.

To efficiently manage your HPC datasets, you can configure lifecycle policies, which are rules that define how to handle datasets. You can set automated lifecycle policies to enforce lifecycle rules. You can set up automated lifecycle policies for Amazon S3, Elastic Block Storeage (EBS), and Amazon EFS.

Our example CAE application is well suited to the FSx for Lustre file system since it allows high performance concurrent access from multiple compute instances. You can use the S3 Data Repository settings to automate the migration of the data to Amazon S3 and then the S3 lifecycle policies to move the data from different storage classes like S3 Standard-IA and S3 Glacier Deep Archive.

Networking

For tightly coupled workloads, it’s important to remove all network bottlenecks to guarantee the best utilization of your computing resources. These applications are sensitive to network latency so we recommended using Cluster Placement Groups (CPG) and Elastic Fabric Adapter (EFA). Deploying your compute nodes in the same CPG, will allow for reliable low-latency communication between the instances, and will help your tightly-coupled application to scale as desired.
Elastic Fabric Adapter (EFA) is a network interface for Amazon EC2 instances that enables customers to run HPC applications requiring high levels of inter-instance communications, like computational fluid dynamics, weather modelling, and reservoir simulation, at scale on AWS.

Coming back to our example, most of the HPC applications used in the automotive industry implement distributed memory parallelism using the Message Passing Interface (MPI), allowing a single simulation to employ parallel processing using the cores and memory of multiple instances to increase simulation speed. MPI applications perform best using instances which feature EFA because it reduces latency and increases throughput for MPI parallel communications.

Orchestration

If you want to reduce the environmental impact of HPC workloads, it’s important to right size your compute resources and deploy only when needed. There is usually no need to have all your compute always up and running when there are no jobs to run.

On AWS, you can start and stop the computing nodes based on your users’ needs: they submit their jobs and AWS HPC infrastructure scales capacity up (or down) to aim for the right capacity when needed.

AWS Batch is a fully managed batch computing service that plans, schedules, and runs your containerized HPC or Machine Learning workloads across the full range of AWS compute offerings, such as Amazon ECS, Amazon EKS, and AWS Fargate, using Spot or On-Demand instances. If you select the allocation strategy, AWS Batch will automatically select the Spot instances that are large enough to meet the requirements of the jobs and are less likely to be interrupted. Using Spot instances is a great mechanism to run your compute intensive workload leveraging our unused capacity.
AWS ParallelCluster is an open-source cluster management tool that makes it easy for you to deploy and manage HPC clusters on AWS. ParallelCluster uses a graphical interface to let you model and provision all the resources needed for your HPC applications in an automated and secure manner. It supports multiple instance types and job submission queues using SLURM.

For example, automotive customers tend to deploy their HPC infrastructure using AWS ParallelCluster because it can configure EFA automatically and mount an FSx for Lustre file system for the high-performance scratch area.

Remote visualization

If you need an interactive graphical application to generate the input files for your jobs, it’s more convenient to also run the pre-processing GUI on AWS. Customers use remote visualization technologies such as NICE DCV to stream the graphics from EC2 nodes back to the end-user laptop. By using a remote visualization technology, you remove the need to copy large datasets to and from the cloud.

For our automotive use case, NICE DCV is an excellent choice for the graphics intensive pre-processing and post-processing stages of the HPC workflows. NICE DCV allows you to use graphics instances with powerful GPUs and direct access to the same data used by the simulation stages, allowing high performance and interactive model setup and visualization of results.

Evaluating sustainability improvements for your workloads

The best way to evaluate success when optimizing HPC workloads for sustainability is to use proxy measures and unit of work KPIs.

The aggregate cost of your computational resources can be a good “rule of thumb” proxy for their energy consumption. Generally, compute resources (EC2 instance types and sizes) with more processing power, use commensurately more energy and mostly have a higher price.

To get a measure of where you currently stand, you could potentially analyze the total volume of simulation or modelling work that you have accomplished within say a typical one-month period, from which you can derive a metric for the average cost per job. Of course, different applications and workloads can have quite different computational resource requirements, so you might be better served deriving an average cost per job for each job type. Your challenge is to then try to reduce this average cost per job, by trying different EC2 instance types or sizes that can deliver the required performance at further reduced cost.

For each job type, it’s also important to define a service level agreement (SLA) represented by the maximum allowable run time (cut-off time) to maintain business operations or staff productivity objectives. This defines the practical limit, beyond which further parallelism, or performance optimization doesn’t help.

For the type and quantity of computing resources that you do select, it’s also important to pay attention to their resource utilization while running the workload. For compute-intensive HPC workloads, you should generally not have any idle instances, and all CPU cores should generally be running at close to 100%, unless explainable by phases of parallel communication or storage I/O. Remember that if network transfers or I/O time are significant it may be beneficial to select EC2 instances with higher speed networking, or perhaps faster local or shared storage options.

To better understand how to use proxy measures to optimize workloads for efficiency, you can read Sustainability Well-Architected Lab on Turning the Cost and Usage Report into Efficiency Reports.

Conclusion

Reducing the environmental impact of HPC workloads is a journey. Moving your workloads to AWS will help you to accelerate the reduction of your carbon footprint emissions quickly. In this post, we provided guidance and recommendations on how to improve your HPC workload on AWS without sacrificing your business outcomes.

To learn more, check out the Sustainability Pillar of the AWS Well-Architected Framework and other blog posts on architecting for sustainability. For more architecture content, refer to the AWS Architecture Center for reference architecture diagrams, vetted architecture solutions, Well-Architected best practices, patterns, icons, and more.

AWS HPC Blog