Securing HPC on AWS – isolated clusters

Securing HPC on AWS – isolated clusters In this post, we’ll share two ways customers can operate HPC workloads using AWS ParallelCluster while completely isolated from the Internet. ParallelCluster supports many different network configurations to support a range of uses.

When referring to isolation we mean situations where your HPC cluster is completely self-contained inside AWS, or where you have a private connection into AWS (maybe over a virtual private network or an AWS Direct Connect private VIF) and you want to partition both on-premises and AWS from the Internet.

This advice applies to various industries – any customer requiring network isolation for running their HPC workloads on AWS should benefit from reading this post.

Background

HPC customers in regulated industries (like the US Department of Defense, for example) often have to attain an Authority to Operate (ATO) before they can run production workloads. An ATO effectively means the customer has done a comprehensive review of the security controls for the services they want to use, and accepted the risk of running those services.

While obtaining an ATO falls on the customer side of the shared responsibility model, AWS has numerous services that have gone through third-party assessments so customers can inherit compliance controls that help to accelerate the ATO process – including processes to implement STIG compliance with ParallelCluster. DoD customers who have achieved an ATO for their AWS environment may not be able to allow their workloads to access the Internet – in fact, allowing that may require an additional extensive review that can be time consuming.

Today we’ll show how you can run HPC workloads on AWS under these constraints, and how you can accomplish it in AWS regions of your choice, including those designed to host regulated workloads with stringent compliance requirements like AWS GovCloud. We’ll dive deeper into what’s needed to provision an isolated environment to solve for these cases using two AWS CloudFormation templates available in our GitHub repo, one with and one without Active Directory (AD) integration – so that you can quickly spin up isolated clusters in as little as 20 minutes.

High-level architecture

Figure 1 depicts a case where customers want their HPC clusters in AWS to be entirely self-contained – with no inbound or outbound Internet connectivity. This can arise when customers have a private network on-premises that they want to extend into their AWS environment, or when they require strict isolation for their AWS-based HPC workloads.

Figure 1 – Shows a high-level architecture of a use case where the HPC cluster in AWS requires no access to the Internet. The left-hand side represents on-premises and remote users having two optional paths to access their AWS infrastructure, either through a Direct Connect or over the Internet which customers may instantiate a VPN as an overlay. A firewall inspects ingress/egress traffic, sends it to an AWS Transit Gateway, and over to an HPC cluster with a head and compute nodes. The Transit Gateway, Firewall, and Virtual Private Gateway are common, yet also optional architectural components for isolated clusters.

Now let’s take a look into what’s happening under the hood when operating an isolated cluster.

Communicating with AWS services in an isolated subnet

Within AWS you can create an Amazon Virtual Private Cloud (VPC) that allows you to operate a logically isolated virtual network. Within the VPC, you can create endpoints that let you establish connectivity between VPCs and AWS services without the need for an internet gateway, a NAT device, VPN, or firewall proxies. This means that infrastructure within the VPC can operate without reaching beyond the VPC or outside of AWS.

You can configure two types of endpoints: interface or gateway. The type of endpoint you create will be driven by the AWS service you’re trying to connect to. An interface endpoint is a collection of one or more elastic network interfaces with a private IP address that serve as entry points for traffic destined to a supported service. A gateway endpoint, on the other hand, targets specific IP routes in a VPC route table, in the form of a prefix-list, used for traffic destined to Amazon DynamoDB or Amazon Simple Storage Service (S3).

Note that while a lot of AWS services have VPC endpoints, some do not such as Amazon Route 53 and AWS Identity and Access Management (IAM).

AWS ParallelCluster in an isolated subnet

To operate an isolated cluster, ParallelCluster requires a mix of both interface and gateway endpoints to establish connectivity to several AWS services depending on what you are trying to do.

HPC customers are probably familiar with using SSH to login to their clusters to submit jobs. You may not be as familiar with AWS Systems Manager Session Manager (SSM) which allows you to login to Amazon EC2 instances without having to open up ports on network interfaces to SSH and manage key pairs.

SSM lets you login to isolated EC2 instances from the AWS Management Console using your browser, which means you don’t have to setup a VPN or Direct Connect connection from your on-premises environment to login to your cluster. To enable this, you need to configure the appropriate VPC endpoints. Once deployed, users can simply authenticate to the AWS Management Console and use SSM to get a shell prompt on an EC2 instance.

SSM is only one feature that customers can use with ParallelCluster. While not exhaustive in describing every ParallelCluster use case, Figure 2 provides a list of required VPC endpoints for common ParallelCluster functions such as building a cluster, building an image, and so on. The CloudFormation templates come with a subset of these endpoints and cover build cluster, monitoring, login nodes, SSM login, and the Directory Service.

The clarification on whether you need a VPC endpoint for your use case is significant as there is a cost associated with using interface endpoints – both on an hourly basis, as well as based on the volume of data processed. There’s no additional charge for using gateway endpoints.

Figure 2 – VPC endpoints for isolated cluster deployments. The top most row depicts common ParallelCluster functions and the left most column shows the AWS service relevant for cluster operations that has a VPC endpoint. The boxes ticked with a green checkmark reflect that the AWS service is used for that particular functionality. Not all endpoints are required depending on your use case which is shown by the red X. The boxes with an asterisk highlighted in yellow show that these endpoints depend on what you are trying to do with those storage options. See the list of services that integrate with AWS PrivateLink to find the corresponding APIs.

Shared storage and VPC endpoints

In Figure 2, the boxes with an asterisk highlighted in yellow fall into an it depends category. When you initiate a pcluster create-cluster command, the CloudFormation service interacts with FSx, EFS, and EBS using internal management paths. Once CloudFormation provisions those resources they communicate with your cluster using DNS or over the EC2 VPC endpoint which is the case for EBS. This means that you can create a cluster with those storage options and use them with your cluster without those endpoints provisioned.

Use cases in which you would need the respective endpoints would be for API calls from instances in the isolated subnet. For example, using the aws fsx create-file-system command from a host in the isolated subnet without the FSx endpoint would fail as the host does not have access to the FSx API. But that same host can access an already attached FSx for Lustre file system using DNS or the filesystem’s IP address.

Another example might be that you want to create EBS snapshots for the volume attached to your head node. you can do this using the Amazon EC2 VPC endpoint, but what if you wanted to compare two EBS volume snapshots to understand the differences between them? This would require connectivity to the EBS direct API which means you would need the EBS VPC endpoint.

Configuration changes for isolated deployments

Since Amazon Route 53 does not have a VPC endpoint we need to make some modifications to the ParallelCluster configuration file. When creating a Slurm cluster, ParallelCluster creates a private Route 53 hosted zone that it uses to resolve the custom compute-node hostnames. For isolated deployments, ParallelCluster must be configured instead to use the default EC2 hostnames which we can achieve by applying the following changes to its config file:

Scheduling:
  Scheduler: slurm
  SlurmSettings:
    Dns:
      DisableManagedDns: true
      UseEc2Hostnames: true

Customers trying to launch clusters using pcluster create-cluster need to pay attention to two small, but vital details when it comes to syntax.

First, is that ParallelCluster will attempt to reach out to the https://iam.amazonaws.com/ endpoint as part of initial validation of the cluster. Like Route 53, IAM also lacks a VPC endpoint, but we can bypass this check using –-suppress-validators. Knowing this, the full command syntax to launch a cluster would be:

pcluster create-cluster --cluster-name {name} --cluster-configuration {file-name.yml} --suppress-validators type:AdditionalIamPolicyValidator

The AWS Security Token Service (STS) endpoint is also required and this service does have a VPC endpoint. Configuring the endpoint, however, will not allow you to launch a cluster until you specify the regional endpoint to send STS calls to. This means you need to run the CLI command export AWS_STS_REGIONAL_ENDPOINTS=regional prior to launching a cluster.

Isolated clusters with and without AD

The two solutions for configuring isolated clusters cater to different user management needs. The main distinction lies in whether you want to integrate your cluster with an Active Directory or not. The two solutions are self-contained and can be deployed individually, as there is no dependency or requirement to implement them together.

Before getting into the differences between the solutions, let’s discuss how we can install ParallelCluster in an isolated subnet. The ParallelCluster CLI is typically installed using PyPI (Python Package Index), which requires internet connectivity to download the necessary packages and dependencies. However, without internet access, we will need the standalone installer which is a self-contained package that includes all the required dependencies for ParallelCluster.

Both solutions follow the same process to install ParallelCluster once the installer is pre-loaded to an S3 bucket. We provision an EC2 instance as a ParallelCluster admin node and it contains a user-data script that pulls the standalone installer down from Amazon S3, installs ParallelCluster, creates a ParallelCluster configuration file, and finally launches a small cluster.

Note that while the ParallelCluster standalone installer allows you to deploy and manage clusters in an isolated environment, it does not provide API-level control over the clusters. We designed the standalone installer for administrative tasks, like creating, updating, or deleting clusters, but it does not offer programmatic access or integration with other AWS services.

Both solutions also take advantage of login nodes that are used to provide access to users to run jobs as an alternative to directly accessing the cluster’s head node.

Figure 3 – Shows an isolated cluster without AD integration. ParallelCluster is installed on the admin node which deploys a cluster with head, login, and compute nodes all in an isolated subnet. There are 6 interface endpoints and 2 gateway endpoints configured.

Where these solutions start to differ is how end-user authentication is performed. The AD integration solution allows you to centralize user account management and access control for your cluster, eliminating the need for SSH key pairs. Users can authenticate with their AD credentials to access the head node or login nodes to launch jobs. Alternatively, if you prefer not to manage an AD solution or wish to minimize costs associated with running additional services, you can opt for the isolated cluster solution without AD integration that relies on SSH key pairs or SSM for authentication.

Shifting focus to AD integration, the architecture incorporates two subnets which is a requirement for AWS Managed Microsoft AD (Directory Service). Another EC2 instance is launched to manage AD that pulls down all of the required tools from our S3 bucket and subsequently installs them. One Network Load Balancer (NLB) is deployed to distribute traffic to AD, and another is for the login nodes. The cluster also uses LDAPS (LDAP over TLS/SSL) with certificate verification which ensures secure transmission of potentially sensitive information.

We spoke about Route53 not having VPC endpoints and how we had to make a modification to the ParallelCluster configuration file. In the case of the cluster with AD integration, we need our hosts to be able to resolve the domain name, corp.pcluster.com, and internal EC2 hostnames, ip-1-2-3-4.(region).compute.internal. We can accomplish both through the Directory Service DNS, which we can configure as a DHCP option set on the VPC so that all hosts obtain the same domain name and DNS IP settings. We can verify this by typing cat /etc/resolv.conf on any of the provisioned nodes. For the cluster without AD integration, we are simply using default EC2 hostname resolution.

Figure 4 – Shows a cluster with admin nodes for ParallelCluster and Active Directory. The ParallelCluster admin node provisions a small cluster with a head, login, and compute nodes in an isolated subnet. There are 7 interface endpoints and 2 gateway endpoints configured so that the cluster can communicate with those AWS services without Internet connectivity. The Directory service is deployed in two subnets with a Network Load Balancer in front.

What’s next?

Once you’ve finished deploying an isolated cluster you can play around with launching a customized version yourself.

The simple ParallelCluster configuration file is located in the /usr/bin/pcluster directory on the ParallelCluster admin node and sample clusters are located in the same GitHub repository. You can create a new configuration file and specify additional parameters related to the head node, scheduling, shared storage, and more to get comfortable with ParallelCluster’s operation in an isolated environment.

You can use the same methods we used for the pre-requisite section to get application specific software into Amazon S3 and onto your cluster. The AWS CLI can be used to copy files from S3 over the existing gateway endpoint.

Conclusion

Operating isolated HPC clusters in AWS using ParallelCluster addresses the diverse needs of customers across various industries, including regulated sectors with stringent compliance requirements.

This post demonstrated how customers can deploy self-contained clusters within AWS, eliminating the need for on-premises network connectivity or Internet access.

For scenarios where centralized user management is preferred, we explored integrating AWS Managed Microsoft Active Directory with the HPC cluster, enabling secure authentication via AD credentials. Alternatively, for customers seeking a more lightweight solution, we showcased deploying isolated clusters without AD integration, leveraging SSH key pairs or AWS Systems Manager for authentication.

Regardless of the chosen approach, these solutions we provide here are scalable and flexible for running HPC workloads in AWS while maintaining complete isolation.

AWS HPC Blog