Dear All,
I was using explicit data management for my openACC code and the code works for multiple GPUs. In my implementation, I can group GPUs to different MPI communicators, this can facilitate multiple component coupling. For gpus in different communicators, their communication have to go through the host.
For my own curiosity, I have compiled my code with -gpu=managed.. For a single gpu, the code is slower than the one without -gpu=managed., which is what I expected.
But for my multiple gpu case (the gpus are in different communicators), the code with -gpu=managed is actually faster (almost twice), this is not what I expect and I don’t understand why.
Could someone please give some possible explanation? this could point me to the right direction to optimize the code for multiple gpu cases.
Thanks
Feng