I’ve been given a complex piece of CUDA code to investigate. It is designed to run in parallel on multiple GPUs but it apparently does not. NVVP shows the GPUs execute in serial. I have made a few amendments to the code to enable asynchronous data transfer with concurrent kernel execution. When I use nvvp to profile the code, those parts of the code I amended to enable this show the GPUs run in parallel but in the rest of the code the GPUs which I have not amended the GPUs still execute in serial.
How can this be?
If your code is using cudaMemcpy or cudaDeviceSynchronize, those calls will block the host thread. Therefore a sequence like this:
will not allow kernel0 and kernel1 to execute concurrently, even though they are launched onto 2 separate devices.
If refactoring code of that type to use cudaMemcpyAsync, for example, is what you mean by “amended” then it seems you are already aware of this concept, and your question is puzzling. If you don’t fix that sort of issue, the code will not run concurrently on separate devices.