AWS HPC Blog
Optimize Protein Folding Costs with OpenFold on AWS Batch
This post was contributed by Sachin Kadyan, a leading developer of OpenFold, and Brian Loyal, Sr Solutions Architect for AI/ML at AWS.
Introduction
Knowing the physical structure of proteins is an important part of the drug discovery process. Machine learning (ML) algorithms like AlphaFold v2.0 significantly reduce the cost and time needed to generate usable protein structures. These projects have also inspired development of AI-driven workflows for de novo protein design and protein-ligand interaction analysis.
Researchers have used AlphaFold to publish over 200 million protein structures. However, newer algorithms may provide cost or accuracy improvements. One example is OpenFold, a fully open-source alternative to AlphaFold, optimized to run on widely available GPUs.
In this post, we build on prior work to describe how to orchestrate protein folding jobs on AWS Batch. We also compare the performance of OpenFold and AlphaFold on a set of public targets. Finally, we will discuss how to optimize your protein folding costs.
OpenFold
Scientists need accessible, flexible, and cost-effective tools for analyzing protein structures. OpenFold is a protein folding model with several advantages for R&D teams. First, OpenFold uses PyTorch, a powerful ML framework popular with deep learning researchers. Second, the OpenFold development team at Columbia University released the inference code, training code, and model weights under a permissive open-source license. The development team also shared 4.5 million protein sequences and multiple sequence alignments (MSAs) used to train OpenFold via the Repository of Open Data on AWS (RODA). This means that anyone can use and improve the model without restrictions. Finally, OpenFold is optimized to run efficiently on common graphical processing units (GPUs), like those found in accelerated Amazon Elastic Compute Cloud (EC2) instance types.
OpenFold uses methods like low-memory and in-place attention to optimize memory use during inference. This allows the algorithm to predict structures with up to 4,600 residues on a A100 GPU with 40 GiB VRAM. It also uses FlashAttention to speed up model training and is compatible with alternative MSA-generation tools like MMSeqs2.
When predicting a structure, OpenFold first builds a numerical representation of the protein sequence by passing it through multiple layers of a MSA Transformer Module, similar to the AlphaFold Evoformer. This maximizes information flow between the MSA representation (which captures evolutionary information) and the pair representation (which captures interactions between amino acids). OpenFold refines these representations by recycling them through the module multiple times. After a preset number of cycles, an additional Structure Module builds and expands the 3D structure by applying a special form of attention on frames centered on individual residues.
AWS Batch Architecture for Protein Folding and Design
The AWS Batch Architecture for Protein Folding and Design is an extendable solution for protein structure analysis on AWS. This solution allows customers to predict the structure of an unknown amino acid sequence on OpenFold and AlphaFold simultaneously. We expect to add support for additional algorithms in the future. Customers can provision the infrastructure in their own AWS accounts using AWS CloudFormation in less than 30 minutes. They can also automatically download data dependencies like UniRef90, BFD, and PDB70 from public repositories onto a secure Amazon FSx for Lustre file system.
An AWS CloudFormation template provisions the necessary network, file system, container, and computing resources. If requested, it will also download the necessary public reference data. Once the installation is complete, users can submit protein analysis jobs to one or more Batch queues using the included BatchFold Python library. Customers can use the provided code to submit folding jobs from other AWS services like Amazon SageMaker Studio or AWS Step Functions.
Monomer Folding with OpenFold
To demonstrate monomer folding with OpenFold, we’ll start by creating a target for 7FCC, an IL-1 binding domain.
Next, we define and submit a JackHMMER job to generate the MSA files needed for accurate AlphaFold or OpenFold analysis. In this case, we specify that the job run in an environment with 16 vCPUs and 31 GiB of system memory. We then submit the job to the Graviton-Spot job queue to select the best instance type from the m6g, r6g, and c6g families.
Finally, we define and submit an OpenFold job to predict the structure of our target. We specify 1 GPU and submit the job to the G4dn job queue. By adding a depends_on value, the OpenFold job will wait for the MSA job to finish successfully before starting.
Once the jobs have completed, we download the results from Amazon Simple Storage Service (Amazon S3) and visualize the structure using py3Dmol.
Comparing OpenFold and AlphaFold 2 for Structure Prediction
To compare the performance of OpenFold and AlphaFold on AWS Batch we examined 32 monomer proteins submitted to the CAMEO protein target dataset in July and August 2022. We pre-computed MSAs for each target using JackHMMER against the full BFD database. Then, we used both algorithms to predict the structure of each target using the default parameters. The folding jobs ran on g4dn.xlarge instances with 4 vCPUs, 16 GiB of memory, and a single T4 GPU.
We compared the predictions against the experimentally-determined structures deposited in the RCSB Protein Data Bank and calculated the prediction accuracy (GDT_TS) using TMScore. The mean GDT_TS difference between the two models was less than 1%. The scores for 19 of the targets were greater than 0.9, an experimental level of accuracy.
Because AWS Batch will automatically terminate unused instances, the total cost of each job strongly depends on the run duration. On average, OpenFold generated predictions 90% faster than AlphaFold. This is due to several reasons.
- By default, AlphaFold runs 5 models in series to generate its final prediction.
- The AlphaFold algorithm includes a compilation step at the start of each job to optimize the pretrained network. OpenFold requires no such pre-compilation for g4dn instance types.
- The OpenFold container image defined in the AWS Batch Architecture for Protein Folding is optimized to run on AWS G- and P-family EC2 instance types.
Based on this data, we recommend running OpenFold on the G4dn job queue for predicting the structure of single-chain proteins with fewer than 1,300 residues.
Analyzing Large Proteins with Amazon EC2 G5 Instances
We recommend the following steps for protein targets with more than 1,300 residues:
- Submit your OpenFold jobs to the optional G5 Job Queue. This increases the available VRAM from 16 to 24 GiB. Note that G5 instance types are not available in all AWS regions.
- Set the long_sequence_inference (LSI) flag to True in your OpenFold job. This improves memory usage at the cost of increased run time. Note that this may require additional CPU memory to prevent out-of-memory errors.
OpenFold generated predictions for proteins as long as 1,838 residues when running on a g4dn.xlarge instance with LSI activated. This length increased to 2,194 residues on a g5.xlarge instance.
Reducing the Cost of MSA Generation with AWS Graviton2 and Spot Instance Types
The longest step of the end-to-end prediction pipeline for both OpenFold and AlphaFold is MSA generation. In our previous post, we described how to save costs by using AWS Batch to generate MSAs on CPUs. However, it is possible to reduce costs even further using Graviton2-based and Spot instances.
We’ve optimized the MSA-generation container included in the AWS Batch Architecture to run on Graviton2 processors by default. This results in up to 40% better price performance over comparable current-generation x86-based instances. In addition to this, users can choose to submit their JackHMMER MSA jobs to the Graviton-Spot Job Queue for additional cost savings of up to 90%.
In the case of our 7FCC example, Batch provisioned a r6g.4xlarge instance (16 vCPU, 128 GiB memory) for the JackHMMER job, which ran for 98 minutes. As of this writing, the On-Demand pricing for this instance size in the US East (N. Virginia) region is $0.81 per hour. However, the Spot price was only $0.30, a 62% savings. Note that the actual instance type and spot pricing will vary based on regional availability.
The most expensive option is to run both jobs on a single g4dn EC2 instance. However, customers can reduce the run cost by (1) running a separate MSA Generation job on a non-accelerated instance type, (2) using OpenFold instead of AlphaFold for the structure prediction, (3) running the MSA Generation job on a Graviton2-based instance type, and (4) using Spot instances for up to 90% savings.
Cleanup
To clean up and stop all ongoing charges, first navigate to the CloudFormation console and select Stacks from the sidebar. Then, turn off the View nested option to hide the nested stacks. Finally, select your stack and click Delete. This will remove all resources and data associated with the AWS Batch Architecture for Protein Folding and Design from your account.
Conclusion
The application of machine learning to life science problems like protein structure prediction is rapidly advancing. New algorithms like OpenFold promise to improve the speed of discovery while keeping R&D costs under control. Flexible HPC services like AWS Batch help customers take advantage of all the AI-driven tools in their toolbox.