I am planning to build a more powerful GPU server. I am debating between a multi-GPU system (with cheaper GPUs) and a single powerful GPU (more expensive).
Before I venture out on that, I am wondering about the following:
whether the UM when using multiple devices with CC>6.x treats the multiple GPUs like a single GPU.
whether GPU-GPU data communication is automatically handled by the compiler, so that I don’t have to make any changes to my single-GPU OpenACC code.
Would appreciate your input on the above. As always, if there is any literature on this, please feel free to point me to it.
Multiple GPUs are treated separately so you need to use MPI with each rank assigned to a particular GPU. There are other methods to support multi-GPU, but I find using MPI the easiest method and then would allow you to scale across multiple systems in the future.
CUDA Aware MPI, which does GPU direct communication, is enabled by default with the MPI versions we ship with the compilers. However, you need to pass the device pointers to the MPI calls by using an OpenACC “host_data” region. Passing UM pointers will work, but MPI wont recognize these as device pointers so wont use GPU direct. Hence if using MPI, I recommend you manually managed your data via data regions.
See the following post with the code I use for device assignment as well as links to some training: