AWS for Industries
Orchestrating Multiple AWS HealthOmics Workflows at Scale
In this blog, you’ll discover how to harness the power of serverless computing to streamline genomics analysis workflows, enabling faster insights. By leveraging AWS Step Functions, AWS Lambda, and AWS HealthOmics, you can parallelize and modularize resource-intensive genomics pipelines. This approach enhances efficiency, scalability, and cost optimization, empowering businesses and researchers to accelerate time-to-insight. The technical implementation details, coupled with real-world use cases, will guide you through the process of architecting and deploying scalable, event-driven genomics analysis solutions on the cloud.
Clinical diagnostics and biopharma industries are witnessing a seismic shift, driven by the ever-increasing demand for efficiency and precision in genomics analysis. As next-generation sequencing technologies continue to evolve, state-of-the-art genomic sequencers generate vast troves of data in formats like FASTQ and BAM, necessitating scalable and cost-effective solutions. Enter AWS HealthOmics, a fully managed service that offers powerful pre-built genomics analyses (Ready2Run workflows, R2R) and the ability to run your own custom (“private”) workflows, transforming and simplifying genomic data analysis.
By leveraging AWS HealthOmics, bioinformaticians, researchers, and scientists can streamline complex genomics pipelines, accelerate time-to-insight, and drive innovations in precision medicine.
The coordination of multiple high-level analysis phases presents a formidable challenge in bioinformatics and genomics. Each analysis phase can be a complex pipeline in itself, with its own dedicated development team and lifecycle. While much of the industry today focuses on running individual workflows, there is an anti-pattern of creating monolithic “do-everything” workflows, which can be cumbersome and difficult to maintain. By leveraging the power of AWS HealthOmics and AWS Step Functions, customers can modularize high-level analyses into component workflows and execute them in parallel at cloud scale. This approach allows for the decoupling of complex genomics pipelines into reusable and independently manageable components, facilitating collaboration, scalability, and efficient resource utilization.
The orchestration system is driven by AWS Step Functions state machines, which orchestrate the parallel execution of multiple workflows. The primary state machine iterates through the samples listed in the input file with sample metadata, invoking a secondary state machine for each sample. The secondary state machine encapsulates the logic for triggering and monitoring AWS HealthOmics workflows, utilizing AWS Lambda functions to initiate the workflow execution and periodically check the status using the AWS HealthOmics API.
This decoupled architecture allows for efficient parallelization of multiple workflows, maximizing resource utilization. The solution’s design ensures seamless integration with both R2R workflows and custom private workflows hosted in AWS HealthOmics. By abstracting the workflow execution details within the secondary state machine, the architecture promotes reusability and extensibility, enabling technology leaders to easily incorporate new workflows or modify existing ones without disrupting the overall orchestration logic.
Solution Architecture:
The proposed solution employs a modular and scalable serverless architecture to orchestrate AWS HealthOmics workflows efficiently.
Figure 1: An architecture diagram showing how to orchestrate multiple AWS HealthOmics Workflows at scale using AWS Lambda and AWS Step Functions
AWS Lambda is a serverless compute service that enables running code without provisioning servers, making it ideal for event-driven, parallel execution of tasks.
AWS Step Functions is a serverless orchestrator that sequences Lambda functions and integrates with other AWS services using visual workflows. Step Functions supports long-running workflows, making it suitable for orchestrating multi-step genomics pipelines and parallelizing analysis workloads using the Map state.
AWS HealthOmics is a managed service tailored for genomics and life sciences workloads. It offers pre-built, production-ready R2R (Ready2Run) workflows and allows creating custom private workflows using WDL, Nextflow, and CWL.
The diagram illustrates the flow of data and execution between the various components, highlighting the modular and parallelized nature of the solution. The orchestration of multiple workflows is achieved through the ‘SFNA’ state machine, while the ‘SFNB’ state machine provides a reusable component for executing individual AWS HealthOmics workflows with automated workflow execution and notifications of status and errors.
The orchestration process is initiated by uploading a JSON metadata file to an Amazon S3 bucket. This metadata file serves as the central source of information, containing details about the genomic data samples and the associated workflows. Specifically, the JSON metadata file includes:
1. A list of samples, each with its unique sample ID and a reference to the corresponding FASTQ file located in another Amazon S3 bucket dedicated to storing sample data.
2. The AWS HealthOmics workflow ID (AHOSampleWorkflowID) associated with each sample, indicating the specific genomics analysis workflow to be executed for that sample.
3. A reference to a Quality Control (QC) workflow file stored in a separate Amazon S3 bucket (QCFASTQ), along with the corresponding AWS HealthOmics workflow ID (AHOQCWorkflowID) for the QC workflow.
The key components of this architecture are as follows:
1. ‘Main’ S3 Bucket: This S3 bucket hosts the JSON metadata file containing information about the samples from a sequencing run and the QC workflow. By utilizing a JSON metadata file stored in an Amazon S3 bucket, the architecture enables seamless chaining of AWS HealthOmics R2R workflows with custom private workflows, facilitating end-to-end analysis pipelines.
2. ‘TriggerLambda’: This Lambda function is triggered when the JSON metadata file is uploaded to the Main S3 bucket. It parses the metadata file and starts the ‘SFNA’ Step Functions state machine.
3. ‘SFNA’ Step Functions State Machine: This state machine orchestrates the parallel execution of multiple sample workflows and the subsequent QC workflow. ‘SFNA’ iterates through the samples listed in the metadata file using a Map state. For each sample, ‘SFNA’ invokes the modular ‘SFNB’ state machine, passing the sample ID, the reference to the FASTQ file in an Amazon S3 bucket, and the associated AWS HealthOmics workflow ID. It has the following states:
a. The ‘ProcessFlowcellSamples’ Map state iterates through the samples and invokes the ‘SFNB’ state machine for each sample.
b. The ‘ProcessQCWorkflow’ state invokes the ‘SFNB’ state machine for the QC workflow.
c. The ‘NotifyAdministrator’ state is responsible for sending a notification upon completion of all workflows.
4. ‘SFNB’ Step Functions State Machine: This modular state machine encapsulates the logic for executing AWS HealthOmics workflows. It has the following states:
d. The ‘TriggerAHO’ Lambda function initiates the AWS HealthOmics workflow execution using AWS HealthOmics API’s ‘start_run’ command, passing the relevant workflow ID and the reference to the input data file stored in an Amazon S3 bucket.
e. The ‘Wait’ state introduces a configurable delay before invoking another Lambda function, ‘CheckAHO’.
f. The ‘CheckAHO’ Lambda function periodically checks the status of the workflow execution by periodically querying the AWS HealthOmics API using the ‘get_run’ command, passing the execution ID obtained from ‘TriggerAHO’. Once the ‘get_run’ command receives a COMPLETED status, ‘SFNB’ concludes the workflow orchestration process.
g. The state machine handles completion, failure, and notification logic.
5. ‘NotifyAdministrator’ Lambda: This Lambda function sends a notification (e.g., email or SNS message) to the administrator upon successful completion of all workflows.
Sample code for this reference architecture can be used for demonstration purposes only and not for production use.
Additional considerations
The above solution leverages a couple integration patterns, such as the “Job Poller” pattern for integrating AWS Step Functions with AWS HealthOmics, and nested state machines. Depending on the duration of your component AWS HealthOmics workflows, these patterns may need to be further optimized for costs and service quotas. For scenarios where chaining is simple and linear from one workflow to the next, you could consider using an event based architecture as an alternative. In the fullness of time, you can expect additional and simplified integration between AWS Step Functions and AWS HealthOmics.
Conclusion
This blog presented a comprehensive solution that leverages the power of AWS serverless technologies, including AWS Lambda, AWS Step Functions, and AWS HealthOmics, offering a modular, scalable, and cost-effective approach to run multiple genomic workflows in parallel. This approach maximizes resource utilization, accelerating the overall analysis time and enabling faster time-to-insight for technology leaders and their teams.
To try this out on your own, the source code and additional technical details for this solution are available as an open-source repository on GitHub.
By leveraging AWS HealthOmics R2R workflows and enabling the integration of custom private workflows, the solution empowers organizations to offload commoditized aspects of their analysis, freeing up valuable time and resources to concentrate on the differentiated and proprietary aspects of their genomics pipelines, drive innovation, and unlock new frontiers in precision medicine and genomics research.