OpenACC code using gpu=managed is faster than explicit data management for multiple gpu computation

Dear All,

I was using explicit data management for my openACC code and the code works for multiple GPUs. In my implementation, I can group GPUs to different MPI communicators, this can facilitate multiple component coupling. For gpus in different communicators, their communication have to go through the host.

For my own curiosity, I have compiled my code with -gpu=managed.. For a single gpu, the code is slower than the one without -gpu=managed., which is what I expected.

But for my multiple gpu case (the gpus are in different communicators), the code with -gpu=managed is actually faster (almost twice), this is not what I expect and I don’t understand why.

Could someone please give some possible explanation? this could point me to the right direction to optimize the code for multiple gpu cases.

Thanks
Feng

Difficult for me to tell, but when I’ve seen this it was due to me explicitly copying more data then needed. For example the code printed out the interior of an array. For explicit data movement, I copied the full array. For managed, the driver only need to copy the printed elements.

Hi Mat,

Thanks for your reply. It might be that case, I think I need to do some multiple GPU optimization.

Thanks,
Feng

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.