CUDA Pro Tip: Always Set the Current Device to Avoid Multithreading Bugs

Originally published at: https://developer.nvidia.com/blog/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/

We often say that to reach high performance on GPUs you should expose as much parallelism in your code as possible, and we don’t mean just parallelism within one GPU, but also across multiple GPUs and CPUs. It’s common for high-performance software to parallelize across multiple GPUs by assigning one or more CPU threads to each GPU. In this post I’ll cover a…

I had suffered the bug, which always crashs in MPI while receiving message;
The most confuse thing is That, it not crash while the buffer to receive is not very large; This lead me to suspect the stability of MPI;

And more foolish, about a week before, I suffered another bug, which is an issue of open-mpi related to this more or less; https://github.com/open-mpi...
This issue was tread as a bug of open-mpi, and fixed in later version.

After costing days to find the bug, I really hope to saw this post earlier!

Is there any measurable performance impact by calling setCudaDevice unnecessarily? I would hope that if the current device is already 1 then calling setCudaDevice(1) would just exit right away without doing anything significant like talking to the GPU over the PCI bus. Is that how it actually works?

I’d also know about overhead of unnecessarily calling cudaSetDevice().
My problem is that don’t know actually, where the calls re-directed into device 0 as I only used NPP & nvJPEG -libraries.
Now I set a-lot of calls to set device cause don’t know necessarily pints those needed.
Tried with two Quadro 2x00 -series controllers, where that symptom popped out.

I have never seen cudaSetDevice be a major limiter in perf. As far as I know the update is local to the thread only and requires no synchronization.