Hi everyone, I am considering converting a large MPI physics simulation to a GPU hybrid code using CUDA and/or OpenACC.
The code will be run on many nodes, each of which may contain, for example, 8 CPU cores and 1 GPU.
The code is written so that each CPU core executes an MPI process. However, this brings in the problem of those 8 CPUs fighting over that 1 GPU. How is this dealt with? Can the GPUs accelerate the code so much that if I only ran 1 CPU and 1 GPU per node my code would still run faster?
Titan, the new supercomputer at Oak Ridge, for example, has 16 CPU cores / GPU. So clearly there must be some solutions to this problem.