Leveraging Seqera Platform on AWS Batch for machine learning workflows – Part 1 of 2

This post was contributed by Dr Ben Sherman (Seqera) and Dr Olivia Choudhury (AWS), Paolo Di Tomasso, and Gord Sissons from Seqera, and Aniket Deshpande and Abhijit Roy from AWS.

Machine learning (ML) is used for multiple healthcare and life sciences (HCLS) applications, including medical imaging, protein folding, drug discovery, and gene editing. While Nextflow pipelines (like those in nf-core) are commonly used for genomics, they are also being adopted for machine learning workloads.

Nextflow is an excellent solution for many ML-based scenarios. Sometimes you need to continuously (and automatically) retrain your models based on rapidly-changing datasets from external sources such as sequencers. Sometimes you need training and inference resources sporadically but you face constraints getting GPUs or FPGAs – even for short periods. And often pipelines like nf-core/proteinfold have compute and data-intensive inference steps where many samples need to be processed in parallel.

In the next two posts, we’ll show you how these kinds of challenges can be addressed using Nextflow and the Seqera Platform integrated with AWS.

In part one of this two-part blog series, we explain how to build an example Nextflow pipeline that performs ML model-training and inference for image analysis, illustrating how Nextflow supports custom ML-based workflows. We also discuss how health care and life science customers are using this today.

In part two, we’ll provide a step-by-step guide explaining how users new to the Seqera Platform can rapidly get started with AWS, maximizing the use of AWS Batch, Amazon Simple Storage Service (Amazon S3), and other AWS services.

Seqera on AWS

Seqera Platform (previously Nextflow Tower) is a comprehensive bioinformatics data analysis platform deeply integrated with AWS. Seqera is used by leading biotechnology and pharmaceutical companies globally, including 10 of the top 20 global BioPharmas and roughly 10,000 bioinformaticians across hundreds of organizations.

Seqera Platform has several key features:

It is explicitly designed for Nextflow pipelines.
It is cross platform – Seqera works with AWS and other cloud and HPC providers, including on-premises systems. It supports customer’s preferred container runtimes, registries, source code managers, and it uses multiple AWS services, including AWS Batch, Amazon S3, Amazon FSx for Lustre, Amazon Elastic File System (EFS), Amazon Elastic Kubernetes Service (EKS), AWS Secrets Manager and others.
It has a large, active, user and developer community which provide high-quality curated nf-core pipelines and modules.

Seqera Platform can be deployed in two different ways.

Seqera Cloud is a fully managed service hosted exclusively on AWS infrastructure. Presently, there are 8,000+ corporate and research Seqera Cloud users. Researchers can use Seqera Cloud for free and progress to paid plans as their needs evolve.

Seqera Enterprise is a customer-managed version of the Seqera platform that is deployable on-premises or on a customer’s preferred cloud. Some customers install Seqera on-premises, while others deploy Seqera on AWS using Docker Compose or the Amazon Elastic Kubernetes Service (EKS).

Seqera employs a “bring your own credentials” model. As illustrated in the architecture diagram in Figure 1, users log into Seqera and add compute environments by supplying credentials for their preferred cloud or HPC workload manager.

Figure 1: High-level architecture of Seqera on the AWS Cloud.

Seqera Platform users have a private workspace and can be assigned to various shared workspaces, each with its own pipelines, datasets, and compute environments. Seqera sidesteps the complexity of running in different cloud or HPC environments by providing a consistent user experience regardless of the underlying infrastructure.

While most workloads run on-premises or on AWS infrastructure, this ability to deploy to different compute environments is useful for several reasons:

Customers can leverage on-premises HPC clusters and tap cloud capacity when their own resources are fully utilized.
Research frequently involves datasets hosted on third-party clouds, making it more cost-effective to bring the compute to the data rather than transferring large datasets to a local execution environment.
Academic and research efforts frequently involve collaboration among institutions using different infrastructure. Seqera allows these users to seamlessly share pipelines, datasets, computing infrastructure, and research results without exposing private cloud credentials.

Seqera Forge

While users can choose to run pipelines on pre-existing AWS Batch environments, Seqera Forge fully automates creating and configuring AWS Batch compute environments and queues for Nextflow pipelines, following best practices. Seqera can also dispose of cloud resources when they’re not in use, helping reduce costs.

By leveraging AWS APIs, Forge dramatically simplifies the deployment, configuration, and teardown of AWS infrastructure, making it possible for researchers with minimal knowledge of “CloudOps” to deploy cloud infrastructure themselves.

A sample training dataset

To illustrate how machine learning and inference workloads can be run on AWS using Seqera Platform, we used the Wisconsin Diagnostic Breast Cancer (WDBC) dataset. This is a well-known dataset, often used as an example for learning or comparing different ML techniques, specifically for image classification. It consists of 589 samples, each with a set of 30 features taken from an image of a breast tissue. The diagnosis column in the data indicates whether the sample was benign (B) or malignant (M), as illustrated in Figure 2.

Figure 2: Images from Breast Cancer Wisconsin (Diagnostic) Dataset aligning with tabular data showing that samples are either malignant or benign.

In the sample pipeline, we will train and evaluate multiple models to classify these breast samples as benign or malignant. In a real-world scenario, we can also use k-fold cross-validation to evaluate each model on several randomized partitions of the dataset and use multiple performance metrics with minimum requirements to determine whether a model is “good enough” to be used in production.

For our purposes here, we will simply evaluate each model on a single 80/20 train/test split and select the model with the highest test accuracy.

A sample pipeline

For illustration purposes, we use a simple proof-of-concept pipeline called Hyperopt developed by Seqera Labs. The pipeline takes any tabular dataset as input (or the name of a dataset on OpenML). It then trains and evaluates a set of ML models on the dataset, reporting the model that achieved the highest test accuracy. You can learn more about this pipeline in the article Nextflow and Tower for Machine Learning. The pipeline code is available on GitHub. Figure 3 shows a Mermaid diagram, automatically generated by Nextflow, of the overall pipeline.

Figure 3: The pipeline steps are implemented as Python scripts that use several common packages for ML, including numpy, pandas, scikit-learn, and matplotlib. These dependencies are defined in a Conda environment file called conda.yml.

By default, the pipeline uses the WDBC dataset described above and evaluates five different classification models:

A baseline model (i.e. “dummy” model) that simply predicts the most common label
Gradient Boosting (gb)
Logistic Regression (lr)
Multi-layer Perceptron (mlp) (i.e. neural network)
Random Forest (rf)

When you run the pipeline, you should see something like this:

$ nextflow run hyperopt -profile wave
[...]
The best model for ‘wdbc’ was ‘mlp’, with accuracy = 0.991

This shows that Nextflow ran a pipeline that trained different ML models on the WDBC dataset and evaluated their performance during model inference. Multi-layer perceptron was most accurate in classifying breast tumor images as benign or malignant. For further details of the pipeline and its deployment, refer to part two of this blog series.

While the hyperopt pipeline implements a simple classification model, it provides all the building blocks you need to create your own ML pipelines with Nextflow.

Seqera is also an excellent solution for deploying GPU-based workloads in the AWS cloud. For a hands-on tutorial, see the article Running AI workloads in the cloud with Nextflow Tower — a step-by-step guide.

It’s the customers that matter most

Seqera is used by hundreds of pharmaceutical, healthcare, and biotech companies to run data analysis pipelines in the AWS Cloud. According to the latest 2023 State of the Workflow Survey, AWS is the most popular cloud environment among Nextflow users, with 49% of all Nextflow users surveyed already using or planning to use AWS within the next two years and 35.1% of survey respondents using AWS Batch [1,2]. The survey results showed strong cloud adoption, with the percentage of Nextflow users running in the cloud up 20% over 2022 [2].

Among the customers running Seqera and Nextflow on AWS are:

Arcus Biosciences—Arcus Biosciences is at the forefront of designing combination therapies, with best-in-class potential, in the pursuit of cures for cancer. By using Seqera Platform on AWS, Arcus was able to improve productivity, ensure pipeline traceability, and use cloud resources more efficiently. They were also able to prepare for future growth by scaling capacity for research and clinical trials while providing an intuitive, collaborative user experience to researchers and clinicians. Read the case study here.
Gritstone Bio—Gritstone Bio is developing targeted immunotherapies for cancer and infectious disease. Gritstone’s approach seeks to generate a therapeutic immune response by leveraging insights into the immune system’s ability to recognize and destroy diseased cells by targeting select antigens. Their workloads involve massive compute requirements for analysis of individual biopsies and makes extensive use of machine learning for tumor classification models. Gritstone use Seqera and multiple AWS cloud services to manage their bioinformatics pipelines. Read the case study here.
Tessera Therapeutics—Tessera Therapeutics are pioneers in a new category of genetic medicine and rely heavily on genomic analysis pipelines to identify promising new treatments. By using Seqera Platform to manage analysis pipelines on AWS, Tessera increased its analysis throughput and research productivity while simultaneously containing cloud spending by using resources more efficiently. You can read the case study here.

Conclusion

For organizations collaborating on large-scale data analysis and ML workloads, Seqera on AWS is an excellent solution. You can easily deploy powerful AWS compute and storage resources at scale, reduce costs through optimized resource usage, and manage spending across projects and teams.

In part two of this blog series, we will provide a step-by-step guide, explaining how you can easily deploy a Seqera environment on AWS to run ML pipelines like the one above, and other Nextflow pipelines.

References

[1] The State of the Workflow 2023: Community Survey Results.

[2] The State of the Workflow 2022: Community Survey Results.

The content and opinions in this blog are those of the third-party author and AWS is not responsible for the content or accuracy of this blog.

AWS HPC Blog