AWS HPC Blog

Protein language model training with NVIDIA BioNeMo framework on AWS ParallelCluster

This post was contributed by Marissa Powers and Ankur Srivastava from AWS, and Neel Patel from NVIDIA.

Proteins are large, complex biomolecules. They make up muscles, form into enzymes and antibodies, and perform signaling throughout the body. Today, proteins are the therapeutic target for the majority of pharmaceutical drugs. Increasingly, scientists are using language models to better understand protein function, generate new protein sequences, and predict protein properties [1].

With the recent proliferation of new models and tools in this field, researchers are looking for help to simplify the training, customization, and deployment of these generative AI models. And our high performance computing (HPC) customers are asking for how to easily perform distributed training with these models on AWS.

In this post, we’ll demonstrate how to pre-train the ESM-1nv model with the NVIDIA BioNeMo framework using NVIDIA GPUs on AWS ParallelCluster, an open-source cluster management tool that makes it easy for you to deploy and manage HPC clusters on AWS. NVIDIA BioNeMo is a generative AI platform for drug discovery. It supports running commonly used models, including ESM-1, ESM-2, ProtT5nv, DNABert, MegaMolBART, DiffDock, EquiDock, and OpenFold. For the latest information on supported models, see the BioNeMo framework documentation.

Figure 1: This image shows the workflow for developing models on NVIDIA BioNeMo. The process is divided into phases for model development and customization and then fine-tuning and deployment.

Figure 1: This image shows the workflow for developing models on NVIDIA BioNeMo. The process is divided into phases for model development and customization and then fine-tuning and deployment.

This example deployment also leverages Amazon FSx for Lustre and the Elastic Fabric Adapter (EFA), which can both be provisioned and configured by ParallelCluster. With ParallelCluster, users can scale out distributed training jobs across hundreds or thousands of vCPUs. Code examples, including cluster configuration files, are available on GitHub.

Walkthrough

For this example, we will (1) create an HPC cluster using AWS ParallelCluster, (2) configure the cluster with BioNeMo framework and download datasets, and (3) execute a pre-training job on the cluster.

Prerequisites

We assume you have access to an AWS account and are authenticated in that account. The configuration file used here uses p4de.24xlarge instance types powered by 8 x NVIDIA A100 80GB Tensor Core GPUs and was tested in the us-east-1 region. We have also tested it with p5.48xlarge instances powered by 8 x NVIDIA H100 Tensor Core GPUs, which delivered better performance. Check your AWS service quotas to ensure you have sufficient access to these instance types.

ParallelCluster must be installed on a local node or instance. Follow the instructions in our documentation to install ParallelCluster in a virtual environment.

Finally, to pull down the BioNeMo container, you will need an NVIDIA NGC API key. To set this up, follow the guidance under “NGC Setup” in the BioNeMo documentation.

Create a cluster

The AWS awsome-distributed-training GitHub repo provides configuration templates for running distributed training on multiple AWS services, including Amazon EKS, AWS Batch, and AWS ParallelCluster. In this example, we’ll create a cluster using one of the provided ParallelCluster configuration files. You can see an overview of the architecture in Figure 2.

Figure 2 – In this distributed training architecture, ParallelCluster deploys a cluster consisting of a head node in a public subnet, a single queue of p4de.24xlarge instances in a private subnet, an FSx for Lustre filesystem, and a shared Amazon Elastic Block Storage (EBS) volume. FSx is used for input and output datasets, and the EBS volume is used to store job-specific submission scripts and log files. All the resources are deployed within a user’s own VPC.

Figure 2 – In this distributed training architecture, ParallelCluster deploys a cluster consisting of a head node in a public subnet, a single queue of p4de.24xlarge instances in a private subnet, an FSx for Lustre filesystem, and a shared Amazon Elastic Block Storage (EBS) volume. FSx is used for input and output datasets, and the EBS volume is used to store job-specific submission scripts and log files. All the resources are deployed within a user’s own VPC.

To start, clone the awsome-distributed-training GitHub repository:

git clone https://github.com/aws-samples/awsome-distributed-training.git
Bash

Navigate to the templates directory in the locally cloned repository:

cd 1.architectures/2.aws-parallelcluster
Bash

The configuration file we will use is distributed-training-p4de_postinstall_scripts.yaml. Open the file and update to use an ubuntu-based AMI:

Image:
  Os: ubuntu2004
YAML

ParallelCluster will deploy the head node and compute nodes in subnets you specify in the YAML file. You can choose to deploy all resources in the same subnet or use separate subnets for the head node and compute nodes. We recommend that you deploy all resources in the same Availability Zone (AZ). That’s because ParallelCluster will create a Lustre file system for the cluster – which resides in a single AZ, and deploying workloads across multiple Availability Zones will cause additional cost incursions due to the cross-AZ traffic.

Update the head node subnet in L13 in the YAML file. Update the compute nodes subnet in L50:

HeadNode:
  InstanceType: m5.8xlarge
  Networking:
    SubnetId: PLACEHOLDER_PUBLIC_SUBNET
  Ssh:
    KeyName: PLACEHOLDER_SSH_KEY
...
  SlurmQueues:
    - Name: compute-gpu
      CapacityType: ONDEMAND
      Networking:
        SubnetIds:
          - PLACEHOLDER_PRIVATE_SUBNET
YAML

If you are using an On-Demand Capacity Reservation (ODCR), provide the resource id in the YAML file. If not, comment out these two lines:

      CapacityReservationTarget:
        CapacityReservationId: PLACEHOLDER_CAPACITY_RESERVATION_ID
YAML

Once the YAML file has been updated, use ParallelCluster to create the cluster:

pcluster create-cluster \
   --cluster-name bionemo-cluster \
   --cluster-configuration distributed-training-p4de_postinstall_scripts.yaml \
   --region us-east-1 \
   --dryrun false
Bash

Cluster creation will take 20-30 minutes. You can monitor the status of cluster creation with:

watch pcluster describe-cluster \
   --cluster-name bionemo-cluster \
   --region us-east-1 \
   --query clusterStatus
Bash

The cluster will be deployed as an AWS CloudFormation stack. You can also monitor resource creation from the CloudFormation console or by querying CloudFormation in the CLI.

Configure cluster and input datasets

Once the cluster status is complete, follow the guidance in the GitHub repo to finish setup and model pre-training. At a high level, these steps walk through how to (1) pull down the BioNeMo framework container; (2) build an AWS-optimized image; (3) download and pre-process the UniRef50 dataset; and (4) run the ESM-1nv pre-training job.

Once you submit the pre-training job, you can monitor progress by running tail on the output log file:

tail -f /apps/slurm-esm1nv-<job-id>.out
Bash

For more detailed monitoring of the cluster, including memory, networking, and storage usage on both the head node and compute nodes, consider creating a Grafana dashboard. A detailed guide for creating a dashboard for ParallelCluster clusters is available on Github.

Conclusion

In this post, we demonstrated how to pre-train ESM-1nv with the NVIDIA BioNeMo framework and NVIDIA GPUs on AWS ParallelCluster. For information about other models supported by the NVIDIA BioNeMo framework, see the BioNeMo framework documentation. For guides on deploying other distributed training jobs on AWS, check out the additional test case in the awsome-distributed-training repository in GitHub.

For alternative options for deploying BioNeMo framework on AWS, check out our guide for Amazon SageMaker. 

The content and opinions in this blog are those of the third-party author and AWS is not responsible for the content or accuracy of this blog.

Reference

[1] Ruffolo, J.A., Madani, A. Designing proteins with language models. Nat Biotechnol 42, 200–202 (2024). https://doi.org/10.1038/s41587-024-02123-4

Marissa Powers

Marissa Powers

Marissa Powers is a specialist solutions architect at AWS focused on high performance computing and life sciences. She has a PhD in computational neuroscience and enjoys working with researchers and scientists to accelerate their drug discovery workloads. She lives in Boston with her family and is a big fan of winter sports and being outdoors.

Ankur Srivastava

Ankur Srivastava

Ankur Srivastava is a Sr. Solutions Architect in the ML Frameworks Team. He focuses on helping customers with self-managed distributed training and inference at scale on AWS. His experience includes industrial predictive maintenance, digital twins, probabilistic design optimization and has completed his doctoral studies from Mechanical Engineering at Rice University and post-doctoral research from Massachusetts Institute of Technology.

Neel Patel

Neel Patel

Neel Patel is a drug discovery scientist at NVIDIA, focusing on cheminformatics and computational structural biology. Before joining NVIDIA, Patel was a computational chemist at Takeda Pharmaceuticals. He holds a Ph.D. from the University of Southern California. He lives in San Diego with his family and enjoys hiking and traveling.