How to improve reservoir simulation throughput using P5 instances

Figure 1 – A gridded reservoir model

Oil and gas companies run reservoir simulations to better understand how hydrocarbons behave underground. These simulations provide insights into fluid movement, pressure changes, and reservoir characteristics, helping companies make informed decisions about well placement. By simulating various scenarios, companies can assess risks, evaluate economic feasibility, and improve reservoir management practices, ultimately enhancing operational efficiency and profitability in oil and gas extraction.

In the current energy climate characterized by economic uncertainty, environmental concerns, and technological advancements, there is an increasing need to run more and more reservoir simulation scenarios to quantify uncertainty and reduce risk. To meet this increasing demand, reservoir simulators are turning to GPUs from the Amazon Compute Cloud (Amazon EC2) to provide the horse power they need. In 2021, SLB, the world’s largest oilfield services company, released the first version of its INTERSECT reservoir simulator running on GPUs.

In this blog post, we’ll show how we achieved a 10.5x throughput improvement with SLB INTERSECT by using the “P5” GPU instance family in place of traditional CPU methods (using Hpc7a). We’ll focus on simulation throughput (simulations/hour) as our metric rather than the performance of an individual simulation, because this better captures the potential of high compute density instances like the P5.

SLB INTERSECT

The reservoir simulator used in this blog post is the 2023.3 release of INTERSECT from SLB. This version of INTERSECT supports running simulations both in GPU or CPU modes. Typically, reservoir simulators use MPI or – in the case of INTERSECT – a hybrid MPI/OpenMP approach to run a single simulation across multiple nodes.

With the GPU version of INTERSECT, things are different: a simulation runs on a single GPU and therefore, a single node. The P5 instance offers 80GB of GPU RAM, which is more than enough to handle the typical reservoir model sizes used by the industry (a few hundred thousand to a few million, cells).

GPU simulations still require CPU processing, and INTERSECT lets you control the number of CPU threads that it uses. To achieve the best performance, it’s important that those threads are properly pinned to cores that are close to the GPU.

Testing methodology and results

INTERSECT provides a standard set of benchmarks which are representative of typical usage and problem size. The benchmark we selected for this evaluation is the GPU_Waterflood, which is a 2.3 million cell model.

We first ran the simulation on a single hpc7a instance in various configurations to determine the best absolute throughput and throughput/$ that we could achieve using that instance type. We’ve shown the results in Table 1.

Table 1: This table shows the results of running INTERSECT on a Hpc7a instance in various configurations. The best throughput and throughput/$ were achieved by running 2 simulations simultaneously with 96 cores per simulation. Cost is based on the EC2 On Demand price for those instances. ** Wall Clock Time

From the first 3 rows in Table 1 we can tell that for this particular model, the speed up as we increased the core count was not linear. This indicated that it might be possible to increase the simulation throughput by running more than one simulation in parallel on the same node. The last row shows the best throughput we achieved: by running two simulations simultaneously on a single node.

We then ran the simulation on a single P5 instance in various configurations to again determine the best absolute throughput and throughput/$ that we could achieve, which we’ve summarized in Table 2.

Table 2: The results of running INTERSECT on a P5 (GPU-enabled) in various configurations. The best throughput was achieved by running 8 simulations simultaneously with one GPU and 4 CPU threads per simulation.

Figure 2 brings these analyses together and compares the best results we obtained for each instance. The P5 comes out on top when we look at absolute throughput, however when normalizing by the cost the CPU-based hpc7a fared better.

Figure 2: Throughput (simulation/hour) and throughput/$ (simulation/hour/$) comparison between hp7a.96xlarge and P5.48xlarge. The P5 shows a 10.5x throughput improvement whereas hpc7a.96xlarge shows a 28% throughput/$ improvement.

Deep dive on execution parameters

To get the best performance when running INTERSECT, it’s important that we properly pin MPI ranks to CPU cores. In addition, when running on GPU it is important that the GPU and CPU cores selected are close to each other from a bus topology point of view.

INTERSECT runs using a utility called eclrun. This can submit jobs to a workload manager like LSF or PBS. In these cases, CPU pinning and GPU affinity are handled by the workload manager. eclrun can also run the simulation directly in which case you have to indicate how to pin the processes. We use Intel MPI to run INTERSECT, so the pinning is controlled using by the MPI environment variables I_MPI_PIN_DOMAIN and I_MPI_PIN_ORDER while the selection of the GPU is controlled via the CUDA_VISIBLE_DEVICES environment variable.

Running INTERSECT on Hpc7a.96xlarge

To determine the best way to pin the MPI rank, let’s look at the CPU topology of the Hpc7a.96xlarge using lscpu.

From these two outputs, we can tell that the Hpc7a has two sockets, each with 96 cores. Cores are organized in 24 groups of 8 cores that share an L3 cache (the 5^th column encodes this in lscpu -e).

In our test we found that using 4 threads per MPI rank and pinning each rank to 4 cores that share the same L3 cache gave the best result.

For runs that involved a single simulation per node, we set I_MPI_PIN_DOMAIN=4:compact and I_MPI_PIN_ORDER:scatter. These settings meant that the cores were organized in non-overlapping domains of size 4, and that MPI ranks were distributed across those domains to minimize the sharing of shared resources whenever possible.

For runs that involved 2 simulations per node we used the notation I_MPI_PIN_DOMAIN=[F,F0,F00,…, F00000000000000000000000]to define (for each rank) a mask that controls which group of 4 cores is assigned to it.

Running INTERSECT on P5

Let’s look at the CPU topology on the P5 system:

The P5 has 2 sockets of 48 cores each. Cores are organized in 12 groups of 8 cores that share the same L3 cache.

On a P5 instance, there is an affinity between GPUs and CPUs. GPUs 0 to 3 have an affinity with cores 0 to 47 and GPUs 4 to 7 have an affinity with cores 48 to 95.

When running on GPU, there’s a single MPI rank per simulation – and it uses 4 threads. We ran 4 simulations per socket and pinned the rank of each simulation cores that share the same L3 cache. We also make sure the we assigned a GPU to the simulation which had affinity with the cores we used.

Conclusion

In this post we’ve shown how to achieve a 10.5x speed up in simulation throughput by running 8 GPU simulations simultaneously on a P5 instance from Amazon EC2, compared to running 2 CPU-based simulations simultaneously on a Hpc7a instance.

This throughput improvement comes at a 28% decrease in throughput per dollar. We believe that simulation throughput is an important metric to look at when benchmarking reservoir simulators as it better capture the advantages of high compute density instances such as the Amazon EC2 P5 instance.

We would like to thanks SLB for providing access to INTERSECT for running this benchmark.

AWS HPC Blog