MPI and CUDA Fortran

Hello, Is it possible to distribute the workload over 64 mpi threads and 4 Tesla gpus? I have an application that runs well using mpi and Fortran using 20 to 30 thousand cpus on a supercomputer. The application uses domain decomposition to solve an incompressible flow to study the breaking of waves using a two-phase formulation (water and air). A key part of the code uses multigrid to solve a variable-coefficient Poisson equation. The relaxation phase uses tridiagonal sweeps. The finest grid levels are typically 128^3 to 256^3 grid points. The coarsest grid levels are typically 8^3 grid points or smaller. I want to distribute the 64 finest subdomains over the gpus using MPI to manage memory and communication. If the coarsest levels are not efficiently solved using the gpus, I could use the cpus. Would this work? I could use a pencil decomposition that spans the entire domain, but then the tridiagonal sweeps would involve expensive all-to-all communication, which I want to avoid. Hence, my focus on smaller subdomains that are 128^3 to 256^3 grid points. Thank you, Doug.

Hi DougD,

Would this work?

Assuming that the computation over the find grid is parallelizable, it should work.

The problem size is larger enough that it should utilize the full GPU which is a bit problematic since you’ll have 8 ranks sharing the same GPU. Depending upon what each rank is doing, some may need to wait to use a GPU. Running a Multi-Process Service (MPS) server can help with time slicing.