How BAM supercharged large scale research with AWS Batch

This post was contributed by Michael Juster (BAM), Yusong Wang (AWS) and Ala Abunijem (AWS)

Balyasny Asset Management (BAM) is a global investment firm managing more than $22 billion in assets. They faced a unique challenge: how could they empower their approximately 160 investment teams to conduct cutting-edge research across six strategies, often requiring highly parallelized workloads. The solution? A powerful combination of AWS Batch and Amazon Elastic Kubernetes Service (Amazon EKS).

In today’s post, we’ll dive into how BAM harnesses AWS Batch to supercharge their research capabilities, enabling teams to efficiently tackle highly parallelized workloads. We’ll explore how this cloud-based solution is pushing the boundaries of financial research and maintaining BAM’s competitive edge in all market conditions.

BAM’s system: design principles for success

BAM’s system is built on five core principles:

Immediate access to resources: Ensure that no team experiences delays in accessing resources needed to research an idea
Team-based isolation: Maintain strict isolation between teams to protect data and ensure security. Prevent any team from impacting the performance of another team’s applications
Efficient resource utilization: Adopt a pay-as-you-go model, scaling resources up or down based on demand.  Scale to 0 when no jobs are running.
Consistent job submission: Provide a standardized interface for job submission
Cost accountability: Ensure that the costs associated with each team’s jobs are accurately tracked and attributed accordingly

Kubernetes as the foundation: a robust research environment

At the heart of BAM’s progressive research efforts lies a robust and versatile Kubernetes system running on Amazon Elastic Kubernetes Service (EKS). This cloud-native infrastructure serves as a foundational pillar, enabling the organization to push the boundaries of research and innovation.

One of the key advantages of BAM’s Kubernetes system is its production-ready nature, complete with comprehensive logging, metrics, and alerting capabilities. This level of observability and monitoring ensures that the research teams can quickly identify and address any issues that may arise, minimizing downtime and maximizing the efficiency of their workflows.

The system also provides access to a diverse array of storage options, allowing researchers to select the most appropriate solution for their specific needs. From the high-performance Amazon FSx for Lustre file system to the scalable Amazon Simple Storage Service (Amazon S3) and the flexibility of Network File System (NFS), the Kubernetes system empowers researchers to leverage the storage technologies that best suit their data-intensive workloads.

Recognizing the diverse computational requirements of modern research, BAM’s Kubernetes system is also equipped with specialized compute resources. This includes access to GPU-optimized instances for accelerating machine learning and data-intensive simulations, as well as CPU and memory-optimized instances to handle a wide range of computational tasks.

The system enforces isolation at the team-level through Kubernetes namespaces, providing each team with:

Per team access control: Each team manages its own namespace, ensuring they have full control and visibility within their boundaries, while preventing other teams from accessing or viewing their resources.
Secrets management integration: Teams have exclusive access to their own secrets within their namespace, ensuring sensitive information remains secure.
Secure docker image access: With Image Pull Secrets, teams can pull Docker images specific to their namespace, maintaining private, team-specific application management.
Flexible storage access: Persistent Volume Claims (PVCs) offer secure access to BAM’s storage solutions. Teams can individually mount Common Internet File System (CIFS) and NFS shares or use FSx for Lustre for high performance.

Given these benefits, BAM required that any batch scheduler solution integrate with their existing Kubernetes system.

Figure 1: Secure isolation – Team-specific secrets and Kubernetes namespaces provide exclusive access to research data and results.

The scalability challenge

BAM’s existing scheduling options faced challenges accommodating the growing demands of workloads. Running more than 1,000 jobs concurrently per team was not possible, creating a bottleneck that reduced research throughput and limited the exploration of new ideas.

Enter AWS Batch: Revolutionizing BAM’s research infrastructure to the next level

AWS Batch was the game-changing batch system BAM needed. It enabled them to achieve each of the design principles set out at the start:

Scalability: Enables the execution of tens of thousands of concurrent jobs, ensuring immediate resource availability and accelerating research.

Seamless Kubernetes integration: Preserved the existing system’s benefits without requiring new solution design.

Dynamic scaling and resource management: AWS Batch facilitates dynamic scaling, allowing teams to adjust resources based on their daily demands and pay for only what they use.

Cost transparency: Through workload tagging, AWS Batch maintains clear cost attribution to each team.

Team isolation: Namespace isolation was preserved, allowing teams to operate independently within their secure environments.

Unified job submission: A unified client based on the AWS API enhances this independence by streamlining job submissions across various tools.

Figure 2: Seamless integration of AWS Batch and Amazon EKS enabled dynamic scaling and efficient job queueing and resource utilization.

Implementation results

By implementing AWS Batch, integrated with Amazon EKS, BAM has seen impressive results. One of the most notable outcomes has been a substantial increase in research throughput. By leveraging the scalability and parallelization capabilities of AWS Batch, BAM’s teams are now able to run tens of thousands of concurrent jobs, removing previous bottlenecks and enabling them to explore more ideas and hypotheses at a faster pace.

The pay-as-you-go model and dynamic scaling capabilities of the integrated AWS Batch and Amazon EKS solution have also led to significant cost transparency and optimization opportunities. Research teams can now more effectively manage their resource utilization, ensuring that they only consume the necessary compute power and storage, resulting in improved resource efficiency and cost savings.

Finally, BAM has been able to simplify their operations by having a standardized job submission process and seamless integration with their existing tools that reduced complexity for research teams.

Conclusion

AWS Batch, integrated with Amazon EKS, has significantly enhanced BAM’s high performance computing capabilities by providing unmatched scalability, cost-efficiency, and operational ease. This powerful combination empowers BAM to reach new heights of productivity and innovation. It supports diverse applications such as weather backfills, conducting Alpha research, and generating production-grade trading signals. Some teams are running more than 100,000 jobs per week. In conclusion, the integration of AWS Batch not only enhances BAM’s system design but also paves the way for innovative high-volume job processing, ensuring BAM remain at the forefront of performance and efficiency.

For organizations aiming to scale HPC workloads and accelerate time-to-insight, AWS Batch offers a compelling and powerful solution worthy of consideration.

AWS HPC Blog