MPI and CUDA Fortran

I currently use Fortran and MPI to solve an incompressible flow problem using finite differences and domain decomposition. I am solving the problem on a supermicro workstation using 2 Intel E5-2683V4 2.1GHz 16C cpus. The code uses double precision. On the host, I currently use 64 MPI ranks with subdomains that are 128^3 grid points. I want to simulate problems with subdomains that are 256^3 grid points, but the cpus alone are too slow. The main bottleneck is the multigrid solver. I use a Jacobi tridiagonal smoother. The LU decomposition, smoothing, prolongation, and restriction could be performed on gpus. Other portions of the CFD code, use finite difference methods that could also be accelerated by using gpus. I am considering getting either two Quadro GV100 or two Quadro GP100 to accelerate the solver. I would also consider getting two Tesla K40 gpus with active cooling, but I am concerned that they are being phased out. I am new to CUDA Fortran. At 256^3 grid points, the subdomains are not that big. If I get NVIDIA gpus, I anticipate using MPS to fully utilize the gpus by distributing the 64 MPI ranks over the 2 gpus. The Quadro gpus will easily double the cost of my workstation. However, without the gpus, I can only simulate problems with subdomains that are 128^3 grid points. If I can simulate problems with subdomains that are 256^3 grid points by making the code 4 to 8 times faster with gpus, then the cost would be worth it. Can I expect that sort of performance gain for this sort of CFD problem? Thank you, Doug.

I run similar 3D problem sizes (for a finite difference code) and have myself just recently extended a CUDA (c++) code to MPI+Cuda as you describe, running on P100s. I would first recommend restricting your MPI ranks to match the number of GPUs you run on. There is overhead to launching kernels that I have found becomes fairly significant at 128^3 per GPU (but is much more negligible at 256^3), and would imagine you’d have no trouble fitting a 256^3 domain (or larger) on one GPU. (Not to mention overhead from halo-sharing.)

So you could probably get away with dropping MPI completely, although I would recommend retaining it to enable scalability to 2+ GPUs (especially since you already have that infrastructure in place). As another note, I implemented a device kernel to perform the non-contiguous halo sharing (in a pencil decomposition), rather than performing many cudaMemCpys to the buffer (which MPI would send/receive). I further implemented a switch to skip the MPI send/receive when using a 1x1 (or 2x1, or 1x2, as appropriate) decomposition, which was key to matching the speed of my former single-GPU, MPI-less program (which does no halo sharing). These were just things I figured would be optimal, so if anyone has better suggestions I’d love to hear them.

You could probably expect far better than a 4-8x speed-up - I see roughly between 30x and 100x speedups on one GPU (vs. 16 cores with OpenMP).

Thank you for your feedback. For my application, each mpi rank currently has 128^3 grid points. For 4x4x4=64 mpi ranks in 3D, the total size of my domain is 512^3 grid points. I want to be able to simulate a three-dimensional domain with 1024^3 grid points using a combination of cpus and gpus on my workstation. For this size problem, each mpi rank would have 256^3 grid points plus some ghost cells. I think that this is the halo sharing that your mention above. Multigrid is used to solve a Poisson equation. If the gpus significantly improve performance, I could solve the 1024^3 problem on my workstation, and it would be a great step toward a code that could run on supercomputer. The memory that is available on the gpu could limit how much data is loaded from the cpus.

Oops - sorry for misreading. I’m not sure how well it will work if you plan to have more data being processed by each GPU than there is RAM per GPU. If you have enough work to do on each subdomain before needing to share information, you could asynchronously process one domain while loading the next onto a GPU. Otherwise your program will become dominated by the time it takes to swap out the memory of each domain before processing. In other words, if you have a fixed number of GPUs it’s often best to keep your problem size small enough to fit onto the total DRAM of all GPUs at once (and have a 1 to 1 mapping of MPI ranks onto GPUs). How many 1024^3 arrays does your problem require?

For multigrid - you might be interested in https://devblogs.nvidia.com/high-performance-geometric-multi-grid-gpu-acceleration/

Thank you for the multigrid link. For the tridiagonal smoother, I need 5 arrays. With 2 ghost cells, the total memory for each subdomain is 5 x (256+4)^3 x 8 = 703,040,000 bytes. The Quadro GV100 has 32 GB. I will be able to put 48 subdomains on one GPU. With 2 GPUS, I should be able to put all 64 ranks (subdomains) onto the GPUS. The Quadro GP100 has 16 GB. I will be able to put 16 subdomains onto each GPU, so it will take 2 passes between the cpus and the gpus. Either way, it is a lot of data to send across the PCIe.

I would imagine the multigrid code, being your main bottleneck, probably takes longer than the time it takes to load all the data onto the GPUs (at ~10GB/s). But, as you mention in your first post, the finite difference solver (which is what my program is) would also benefit greatly from GPUs, so you could perhaps avoid the host-device transfers completely if your entire program can eventually be ported.

If you have the funds, I would advise getting the 2 V100’s, so that your problem fits entirely on the two GPUs. This would probably allow you to avoid thinking about asynchronously computing and loading domains - which itself could be a viable option if your multigrid solver takes much longer than the time it would take to load all the required data onto the GPU. (For my purposes I found that the GPU computations are so fast that the host-device transfer would completely dominate the runtime.)

I would still strongly recommend a 1-1 mapping of MPI ranks to GPUs. Such a code should scale well to many more GPUs, should you ever need to solve larger problems (though I have yet to be able to test my own in this regard).

Thanks again for the feedback. On my workstation, I will have to be careful with memory. I cannot make copies of arrays that are just used on the device. I have to share arrays on the host and the device or else I will run out of memory. Two Tesla K40 gpus are considerably cheaper than one Quadro GV100 gpu. For the purposes of testing, I could run a smaller job just to see whether there are performance gains.