AWS HPC Blog
Leveraging Seqera Platform on AWS Batch for machine learning workflows – Part 2 of 2
This post was contributed by Dr Ben Sherman (Seqera) and Dr Olivia Choudhury (AWS), Paolo Di Tomasso, and Gord Sissons from Seqera, and Aniket Deshpande and Abhijit Roy from AWS.
In part one of this two-part series, we explained how Nextflow and Seqera Platform work with AWS, and we provided a sample Nextflow pipeline that performs ML model training and inference for image analysis to illustrate how Nextflow can support custom ML-based workflows. We also discussed how AWS customers are using this today.
In this second post, we will provide a step-by-step guide explaining how users new to Seqera Platform can rapidly get started on AWS, maximizing the use of AWS Batch, Amazon Simple Storage Service (Amazon S3), and other AWS services.
Depending on your familiarity with AWS, these steps should take around an hour to complete.
Prerequisites
Before following the step-by-step guide here, readers should:
- Have an AWS account and be familiar with using the AWS Management Console or AWS Command Line Interface (CLI).
- Have a high-level understanding of AWS Identity and Access Management (IAM) users and roles.
- Obtain a free Seqera Cloud account for testing.
Step-by-step guide
This section explains how to set up an AWS environment to support pipeline execution, obtain a free Seqera Cloud account, and configure Seqera to launch and manage the ML pipeline introduced in part 1 of this series.
Create an Amazon S3 bucket
The first step is to create an S3 bucket in your preferred AWS region (we used us-east).
- Once logged into the AWS console, navigate to the Amazon S3
- Click on the Create bucket
- Give the bucket a globally unique name (for example, we used
seqera-work-bucket
, but you will need to choose something else) and select the AWS region where the bucket will reside. - You can leave ACLs disabled and block all public access to the bucket since Seqera Platform will use this bucket internally. Others do not need to see the contents.
- Accept the default for the remaining settings and select Create bucket.
Once the bucket is created, select it and record its Amazon Resource Name (ARN) by clicking Copy ARN. In our example, the ARN is arn:aws:s3:::seqera-work-bucket
.
You will need the name of the ARN to set up an AWS policy in the following steps.
Setup your IAM credentials
To create an AWS Batch environment and marshal other AWS cloud resources, Seqera will need AWS credentials. We’ll use the AWS Identity and Access Management (IAM) service to create appropriate IAM policies and attach these policies to an AWS IAM user or role.
First, create a policy with sufficient privileges to manage the AWS Batch environment and support pipeline execution. Follow the steps below to create a custom “seqera_forge_policy.”
- Navigate to the IAM console by searching for the IAM service
- Under Access Management, select Policies
- Click on Create policy to create a new IAM policy
- Under Specify permissions, select the JSON policy editor
- Copy the contents of the default
forge-policy.json
file located on GitHub into the policy editor - Click Next
- Give the policy a name (like
Seqera_Forge_Policy
) - Select Create policy
Note that this is a sample policy and you should review the policy based on your security requirements, particularly if you want to use it for production.
With these steps, you have created a policy to give Seqera sufficient permissions to deploy an AWS Batch environment on your behalf.
Next, you’ll need to create a similar policy that gives your IAM user or role permission to access the Amazon S3 bucket you created earlier.
- Repeat the steps in IAM as before to create a new policy for accessing the S3 bucket by clicking Create policy from the Policies screen
- Under Specify permissions, select the JSON policy editor
- Copy the contents of the default
s3-bucket-write.json
file located on GitHub into the policy editor - Update the ARN in the
S3-bucket-write
policy to reflect the ARN for the S3 bucket you created - Click Next
- Give the policy a name (like
Seqera_S3_Bucket_Write
) - Select Create policy
You should now have two new customer-managed policies, as shown in Figure 1.
Next, attach these policies to an IAM user or IAM role. You may already have your own IAM users or roles defined, in which case you can bind the policies to existing users or roles.
If you are doing this for the first time, follow these steps to create a new IAM user called seqera-user,
and bind the policies created above to that user like this:
- From the AWS console, navigate to the IAM console
- Under Users, select Create user
- Give the user a name like
seqera-user
- There is no need to provide this IAM user with access to the AWS Management Console
- Select Attach policies directly
- Search for the two customer-managed policies you created earlier by searching on the string Seqera and then select these two policies to apply them to the new IAM user
- After attaching the policies, select Create user
The final step is to create an access key based on the new IAM user that will be used by Seqera:
- Under Access management / Users, select seqera-user
- Record the user’s ARN for future reference
- Click on Create Access key
- Record the Access key and Secret access key and store them in a safe place. You’ll need these credentials to create a compute environment in the Seqera Platform.
Create a free Seqera Cloud account
The next step is to create a free Seqera Cloud account. Navigate to https://seqera.io and click on Login / Sign up. You can create a free account and sign into the Seqera cloud environment hosted on AWS infrastructure using GitHub or Google credentials – or by providing an email address.
After logging in, you’ll be directed to the Seqera Platform console.
Setup an AWS Batch Compute Environment in Seqera
On initial login to the Seqera Platform, you’ll be taken to a community showcase workspace with ready-to-run pipelines and preconfigured AWS compute environments. New users can experiment with community showcase pipelines and Datasets and enjoy up to 100 hours of free AWS cloud credits.
You can also watch a short introductory video that explains how to get started with the Seqera interface.
To add your own compute environments and pipelines, you’ll need to navigate to your personal workspace, as shown in Figure 2. Click on the selector next to community | showcase and select your personal workspace by selecting your name where it appears under user. You can optionally create a new workspace and store your compute environments and pipelines there.
The first time you login, your personal workspace will have no compute environments, datasets, or pipelines. You can create a new AWS Batch Compute Environment (CE) by following these steps:
- Select the Compute Environments tab from within your personal workspace
- Click on Add Compute Environment
Define a new AWS Batch compute environment, by completing the form, as shown in Figure 3:
The first time you add user credentials, you’ll be asked to supply the IAM Access key and Secret access key you generated earlier.
Seqera Platform will store these credentials securely and ask you to assign a name to the credentials for future reference. Your AWS credentials will not be visible to you, or other users.
Continue filling in the form, as illustrated in Figure 4:
Nextflow pipelines running on AWS Batch typically use Amazon S3 storage. Nextflow supports S3 natively, automatically copying data in and out of S3 and making it accessible to containerized bioinformatics tools.
To provide more efficient data handling, we use Seqera’s Fusion file system (Fusion v2) when setting up the pipeline. Fusion is a virtual, lightweight, distributed filesystem that allows any existing application to access object storage using the standard POSIX interface. Using the Fusion file system is optional but it’s recommended because it reduces the pipeline execution time and your cost by avoiding overhead related to data movement. If you use the Fusion file system, you must also enable Wave containers in the compute environment since the Fusion client is deployed in a Wave container.
You can learn more about Fusion file system in the whitepaper Breakthrough performance and cost-efficiency with the new Fusion file system.
Under Config mode, select Batch Forge to have Seqera automatically configure the AWS Batch account on your behalf.
For the provisioning model, select Spot, as shown in Figure 5. Since Nextflow pipeline tasks can tolerate Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances being interrupted, we recommend you use Spot Instances in most cases.
Accept the defaults for the remainder of the configuration options, as shown in Figure 6.
Once you’ve provided details of your Batch Compute Environment, click Add to add the CE.
Assuming you have valid AWS credentials and have filled in the form correctly, you should see a new Compute Environment appear in your Seqera workspace.
After creating the Compute Environment in Seqera, you can login into the AWS Console and navigate to AWS Batch. Assuming you selected Spot provisioning, two new AWS Batch CEs – and queues – will have been created on your behalf.
Seqera Forge configures a Head queue that dispatches Nextflow head jobs to a compute environment configured to use EC2 On-Demand instances. This prevents the Nextflow head job from being interrupted during execution.
A separate Work queue and Spot-based CE are also automatically configured to support Nextflow compute tasks during pipeline execution.
The names of the queues and compute environments are assigned by Seqera Forge, like in Figure 7:
Add the ML pipeline to the Launchpad
Now that we have an AWS Batch CE set up in our Seqera workspace, we can add the ML pipeline to the Seqera Launchpad.
- From your workspace in the Seqera UI, select the Launchpad tab
- Click Add pipeline
- Give the pipeline a name like
ML-hyperopt-pipeline
- Select the Compute environment created above
- Supply the pipeline’s repository URI – for our machine learning example, use
https://github.com/nextflow-io/hyperopt
- Click on the Revision number field, and select the tag or branch name retrieved from the source code manager to select a particular pipeline branch or version. Use master in this example.
- Specify the S3 bucket you created earlier as the work directory (e.g.
s3://seqera-work-bucket
) - Under Config profiles, select wave
- Under Advanced Options in the Nextflow config file field, optionally set
dag.enabled=true
if you would like Nextflow to generate a DAG (directed acyclic graph) file - Accept the defaults for the rest of the pipeline configuration options
- Click Add to add the pipeline to the Launchpad
- You should see the new
ML-hyperopt-pipeline
appear in the Seqera Launchpad in your workspace
Run the pipeline
To run the ML-hyperopt-pipeline
, click on Launch beside the ML-hyperopt-pipeline on the Launchpad.
After launching the pipeline, you can monitor execution under the Runs tab. The pipeline will take a few minutes to start the Amazon EC2 instances supporting the AWS Batch CEs and begin dispatching tasks to Batch.
Assuming the pipeline runs successfully, you should see the following output in the Execution log accessible through the Seqera interface.
N E X T F L O W ~ version 23.10.0
Pulling nextflow-io/hyperopt ...
downloaded from https://github.com/nextflow-io/hyperopt.git
Launching `https://github.com/nextflow-io/hyperopt` [mighty_koch] DSL2 - revision: eb2b84a91a [master]
M L - H Y P E R O P T P I P E L I N E
=======================================
fetch_dataset : true
dataset_name : wdbc
visualize : true
train : true
train_data : data/*.train.txt
train_meta : data/*.meta.json
train_models : [dummy, gb, lr, mlp, rf]
predict : true
predict_models : data/*.pkl
predict_data : data/*.predict.txt
predict_meta : data/*.meta.json
outdir : results
Monitor the execution with Nextflow Tower using this URL: https://tower.nf/user/gordon-sissons/watch/5MUVbQjbhEFjRu
[da/4b6a7f] Submitted process > fetch_dataset (wdbc)
[c5/78cd5e] Submitted process > split_train_test (wdbc)
[68/29cb37] Submitted process > train (wdbc/dummy)
[01/0b89d1] Submitted process > train (wdbc/mlp)
[37/7e1cc8] Submitted process > train (wdbc/lr)
[22/8452c6] Submitted process > train (wdbc/rf)
[6e/763ce3] Submitted process > visualize (1)
[2a/3f3d84] Submitted process > train (wdbc/gb)
[5a/1c71a0] Submitted process > visualize (2)
[cc/7dd9ad] Submitted process > predict (wdbc/dummy)
[b5/f30153] Submitted process > predict (wdbc/mlp)
[de/340f09] Submitted process > predict (wdbc/lr)
[e5/d205ef] Submitted process > predict (wdbc/rf)
[38/6a6eaf] Submitted process > predict (wdbc/gb)
The best model for ‘wdbc’ was ‘mlp’, with accuracy = 0.991
Done!
Note that the pipeline trains each model, runs each model against the dataset, and determines the model with the best predictive accuracy.
If you’ve gotten this far: Congratulations! You’ve successfully set up your first compute environment in Seqera and used AWS Batch to execute a machine learning pipeline.
Conclusion
For organizations collaborating on large-scale data analysis and ML workloads, Seqera Platform running on AWS can be an excellent solution. You can more easily deploy powerful AWS compute and storage resources at scale, while also reducing your costs through optimized resource usage, and managing spend across projects and teams. You can also leverage best-in-class curated bioinformatics pipelines and modules from nf-core and other sources.
This can help improve research productivity with a unified view of data, pipelines, and results, while collaborating more effectively with local and remote teams.
We’re excited to know how this works out for you, or if you have other challenges that AWS or Seqera can help you meet. Contact us at ask-hpc@amazon.com if you want to discuss these with us.
The content and opinions in this blog are those of the third-party author and AWS is not responsible for the content or accuracy of this blog.