CUDA Pro Tip: Always Set the Current Device to Avoid Multithreading Bugs

Originally published at:

We often say that to reach high performance on GPUs you should expose as much parallelism in your code as possible, and we don’t mean just parallelism within one GPU, but also across multiple GPUs and CPUs. It’s common for high-performance software to parallelize across multiple GPUs by assigning one or more CPU threads to each GPU. In this post I’ll cover a…

I had suffered the bug, which always crashs in MPI while receiving message;
The most confuse thing is That, it not crash while the buffer to receive is not very large; This lead me to suspect the stability of MPI;

And more foolish, about a week before, I suffered another bug, which is an issue of open-mpi related to this more or less;
This issue was tread as a bug of open-mpi, and fixed in later version.

After costing days to find the bug, I really hope to saw this post earlier!

Is there any measurable performance impact by calling setCudaDevice unnecessarily? I would hope that if the current device is already 1 then calling setCudaDevice(1) would just exit right away without doing anything significant like talking to the GPU over the PCI bus. Is that how it actually works?