Improve the speed and cost of HPC deployment with Mountpoint for Amazon S3

HPC workloads like genome sequencing and protein folding involve processing huge amounts of input data. Genome sequencing aims to determine an organism’s complete DNA sequence by analyzing extensive genome databases containing gene and genome reference sequences from thousands of species. Protein folding uses molecular dynamics simulations to model the physical movements of atoms and molecules in a protein.

These workloads require analyzing massive input datasets. To support these kinds of applications that need high bandwidth, low latency, and parallel access to lots of data, AWS offers managed storage services like Amazon FSx for Lustre and Amazon EFS that provide file systems optimized for compute-intensive workloads, which eliminates the need for customers to manage underlying storage infrastructure.

When selecting storage for your machine learning training data, Mountpoint for Amazon S3 can provide a good alternative to support these types of workloads. Mountpoint for Amazon S3 is an open-source file client that you can use to mount an S3 bucket on your compute instances, accessing it as a local file system. It translates local file system API calls to REST API calls on S3 objects, and is optimized for high-throughput performance. It builds on the AWS Common Runtime (CRT) library, which is purpose-built for high-performance and low-resource usage to make efficient use of your fleet.

In this post, we’ll deploy Mountpoint for Amazon S3 in an AWS ParallelCluster using the Community Recipe Library for HPC Infrastructure on AWS. We’ll then run the IOR parallel I/O benchmark tool to compare the performance of Mountpoint for Amazon S3 across the cluster, testing access speeds for reading files of varying sizes stored in Amazon S3.

Setting expectations

As shown in Figure 1, We should expect to see the parallel performance scaling well as we increase the number of nodes in the cluster accessing the shared Amazon S3 storage at the same time. We hope this shows you how to achieve high throughput access to virtually limitless Amazon S3 data using a simple approach running ParallelCluster.

Figure 1 – Read and write performance scales well with the number of nodes.

What do you need to try this out?

AWS ParallelCluster Command line interface (CLI) with Node installed. See Installing AWS ParallelCluster in a non-virtual environment using pip.
An existing bucket for the performance test.
An existing Security Group, Subnet, EC2 Key

Steps to deployment with ParallelCluster

For our deployment, we’ll download this ParallelCluster config file here. This config file uses the HPC Github Sample Recipe for Mountpoint for Amazon S3 to help to deploy the mountpoint to the cluster nodes.
Once you’ve downloaded the full ParallelCluster config file, update your own values in the config file,

DEMO-BUCKET-NAME – an existing bucket for testing. Create one if needed.
HOST-FILESYSTEM-PATH – the path that the bucket will mount to inside the nodes. (e.g. /testpath)
Your Subnet ID – a subnet ID for deployment (e.g. subnet-xxxxxxxx). Find the subnet Id in the Subnets home.
Your Security Group – the id of the security group (e.g. sg-xxxxxxxx). Find the security group id in the Security Groups home. Make sure the security group is for the same VPC the subnet is under.
Your ed25519 key – the name of the EC2 Key Pair. You can find the name of your key in the Ec2 Key Pair home. If you do not have one, create a new one.

Now, you can use this command to deploy a cluster:

pcluster create-cluster -c <template file> -r <region> -n <cluster_name>

Deep-dive into the configuration

To examine the config file further, it refers to three post-install scripts designed to work with ParallelCluster custom bootstrap actions. The first two scripts are from the HPC recipes for AWS, whereas the last script was created specifically for this blog. These scripts are being run on the cluster head node and all the compute nodes where the S3 bucket will be mounted. The script s3-mp-install-ior.sh installs the IOR I/O performance benchmark suite and its necessary dependencies on the designated nodes. IOR is a parallel input/output benchmark that can test the performance of parallel storage systems using different interfaces and access patterns. The installation log, located at /var/log/ior-install.log, provides debugging information about the installation process.

install.sh – installs Mountpoint for Amazon S3 and prepares the mount point directory
mount.sh – configures a systemd service that uses Mountpoint for Amazon S3 to mount a bucket to a directory
s3-mp-install-ior.sh – installs the IOR I/O performance benchmark suite. Since the performance testing is for the compute node to perform, this is optional for the head node.

Cluster head node

Here’s an example of the Head Node section of the ParallelCluster configuration. The bucket DEMO-BUCKET-NAME is mounted to the location HOST-FILESYSTEM-PATH on the host. HOST-FILESYSTEM-PATH is an absolute path like /s3mountpoint. You can mount multiple buckets on a single host.

HeadNode:
  CustomActions:
    OnNodeConfigured:
      Sequence:
        - Script: https://aws-hpc-recipes.s3.us-east-1.amazonaws.com/main/recipes/storage/mountpoint_s3/assets/install.sh
        - Script: https://aws-hpc-recipes.s3.us-east-1.amazonaws.com/main/recipes/storage/mountpoint_s3/assets/mount.sh
          Args:
            - <<DEMO-BUCKET-NAME>>
            - <<HOST-FILESYSTEM-PATH>>
            - '--allow-delete --allow-root'
  Iam:
    AdditionalIamPolicies:
      - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
    S3Access:
      - BucketName: <<DEMO-BUCKET-NAME>>
        EnableWriteAccess: true

We’ll be using --allow-delete --allow-root to enable read/write to the bucket because read/write access is required by the performance testing tool. If you’re using this mountpoint for a read only scenario like storing training data, it’s recommended that you use --read-only to prevent accidental overwriting to the training data set. For more options to configure the mount point, see the Mountpoint configuration documentation.

To access the S3 bucket from the header node, you need to enable S3 access through either the S3Access configuration under HeadNode/Iam/S3Access or by attaching an additional IAM policy under HeadNode/Iam/AdditionalIamPolicies. You can use a managed policy such as AmazonS3ReadOnlyAccess to provide generic read-only access, or you can create a custom policy with more specific permissions tailored to your use case. The key requirement is that the EC2 InstanceRole for the header node must have permissions to access the S3 bucket in order for it to read data from or write data to that location. The same will apply to the ComputeNode section.

Compute nodes

Next, let’s look at an example of mounting the same bucket DEMO-BUCKET-NAME to a local path of HOST-FILESYSTEM-PATH in the Compute Nodes section of the ParallelCluster config file.

Scheduling:
    SlurmQueues:
        - Name: demo
          CustomActions:
            OnNodeConfigured:
                Sequence:
                  - Script: https://aws-hpc-recipes.s3.us-east-1.amazonaws.com/main/recipes/storage/mountpoint_s3/assets/install.sh
                  - Script: https://aws-hpc-recipes.s3.us-east-1.amazonaws.com/main/recipes/storage/mountpoint_s3/assets/mount.sh
                    Args:
                      - <<DEMO-BUCKET-NAME>>
                      - <<HOST-FILESYSTEM-PATH>>
                      - '--allow-delete --allow-root'
                  - Script: https://raw.githubusercontent.com/aws-samples/aws-hpc-s3mountpoint/main/s3-mp-install-ior.sh
Iam:
        AdditionalIamPolicies:
          - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
        S3Access:
          - BucketName: <<DEMO-BUCKET-NAME>>
            EnableWriteAccess: true

Steps to run Performance Testing

To execute the IOR benchmark across compute nodes, we’ll use sbatch to submit a job to the HPC cluster. We’ll SSH into the head node using Session Manager to author an sbatch script that calls mpirun to launch the IOR executable with our desired parameters as below. The sbatch script will specify the number of tasks, and Slurm will manage the number of nodes required.

## create the sbatch submission script
cd ~
cat > ior_submission.sbatch << EOF
#!/bin/bash
#SBATCH --job-name=ior-perf-test
#SBATCH --output=%N_%x_%j_%t.out
#SBATCH --ntasks-per-node=8
#SBATCH --ntasks=8

module load intelmpi
mpirun bash -c " 
     cd <<HOST-FILESYSTEM-PATH>>
     ior -r -w -v -F -o=S@S@S -b=2000m -i=1 -t=50m -a=POSIX --posix.odirect"
EOF

## submit the script
cd ~
sbatch ior_submission.sbatch

Slurm will spawn the MPI processes across the cluster worker nodes to perform the parallel I/O tests. The output file will be available on the first compute node.

Performance results

Using this setup, we can use IOR to do some load tests across all our cluster compute nodes. It’s worth noting that the actual mounting operation itself takes just a few seconds

In our test runs, we used a single Amazon S3 bucket and attained exceptional read and write speeds – 4.23 GB/s (4.038 GiB/s) write throughput and 4.85 GB/s (4.622 GiB/s) read throughput per node (using c6i.16xlarge). This high read performance makes Mountpoint extremely well-suited for supporting the intensive read operations inherent in HPC workloads. Here’s some sample output from our own tests running on c6i.16xlarge instances:

Loading intelmpi version 2021.9.0
IOR-4.1.0+dev: MPI Coordinated Test of Parallel I/O
Began               : Thu Dec 14 00:13:39 2023
Command line        : /shared/ior/bin/ior -r -w -v -F -o=S@S@S -b=2000m -i=1 -t=50m -a=POSIX --posix.odirect
Machine             : Linux queue1-st-pclustercr1-1
TestID              : 0
StartTime           : Thu Dec 14 00:13:39 2023
Path                : S.00000000
FS                  : 0.0 GiB   Used FS: -nan%   Inodes: 0.0 Mi   Used Inodes: -nan%
Participating tasks : 64

Options:
api                 : POSIX
apiVersion          :
test filename       : S@S@S
access              : file-per-process
type                : independent
segments            : 1
ordering in a file  : sequential
ordering inter file : no tasks offsets
nodes               : 1
tasks               : 64
clients per node    : 64
memoryBuffer        : CPU
dataAccess          : CPU
GPUDirect           : 0
repetitions         : 1
xfersize            : 50 MiB
blocksize           : 1.95 GiB
aggregate filesize  : 125 GiB
verbose             : 1

Results:

access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
Commencing write performance test: Thu Dec 14 00:13:55 2023
write     4038       83.06      0.039900    2048000    51200      28.34      30.82      30.06      31.70      0
Commencing read performance test: Thu Dec 14 00:14:18 2023

read      4622       92.53      0.682358    2048000    51200      0.086130   27.67      17.48      27.69      0
remove    -          -          -           -          -          -          -          -          7.57       0
Max Write: 4037.71 MiB/sec (4233.85 MB/sec)
Max Read:  4622.06 MiB/sec (4846.58 MB/sec)

Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Stonewall(s) Stonewall(MiB) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum
write        4037.71    4037.71    4037.71       0.00      80.75      80.75      80.75       0.00   31.70114         NA            NA     0     64  64    1   1     0        1         0    0      1 2097152000 52428800  128000.0 POSIX      0
read         4622.06    4622.06    4622.06       0.00      92.44      92.44      92.44       0.00   27.69330         NA            NA     0     64  64    1   1     0        1         0    0      1 2097152000 52428800  128000.0 POSIX      0
Finished            : Thu Dec 14 00:14:53 2023

IOR’s test is designed to scale and maintain consistent throughput per server as we add more compute nodes. You also have the option to modify the performance test settings like blockSize (-b) and transferSize (-t) to see how the throughput changes when you adjust those values.

For our performance runs, we used the options listed in table 1. As you prepare your dataset for testing, take time to optimize the IOR settings. Tailor them to your specific data characteristics for improved accuracy. For instance, adjust the blockSize downward if holding many smaller files. Or remove the -f option if processes commonly read the same files. Taking these steps allows IOR to better simulate your real-world environment. For additional options, refer to the IOR official documentation.

Table 1 – IOR Options used for our performance test runs.

Conclusion

With data stored in Amazon S3 behind the scenes, Mountpoint for Amazon S3 delivers the durability, scalability, and high throughput needed to support demanding machine learning training workloads. It also provides a cost-efficient storage solution.

Mountpoint for Amazon S3 is easy to integrate with other AWS ParallelCluster, and this extends to AWS Batch, and self-managed EC2 instances, too – all popular methods for distributed training workflows. The quick, scalable, and economical nature of Amazon S3 behind Mountpoint removes traditional data storage challenges customers often face when pursuing ML training.

You can get started with Mountpoint for S3 today by building on the sample we’ve provided in the HPC Recipe Library. We highly encourage you to closely examine how the install/configure scripts operate and other recipes in the recipe library. These recipes are designed for cross-platform compatibility and robustness working with ParallelCluster, requiring little to no modification to use.

AWS HPC Blog