I develop a code using mpi library and cublas. My code run runs on fermi nodes (2 nodes, 2 gpu per node and two amd 6 cores per nodes).
If I run my code 2 mpi process on the node and each mpi process has is own gpu, the results are excellent. But, If I have 4 mpi
process on the nodes with the following binding :
[list=1]
[*] mpi task 0 bound on the core id 0
[*] mpi task 1 bound on the core id 3
[*] mpi task 2 bound on the core id 6
[*] mpi task 3 bound on the core id 9
Thus the mpi process 0 and 1 will have a connection to the first gpu, whereas the mpi process 2 and 3 will be assigned to the second gpu.
When I call the cublas by a mpi process, it seems to have a race condition. Per example, the mpi process 1 will wait the completion of the cublas call
from mpi process 0, therefore I lost performance.
My question is : can I get a simultaneously access (or with a minimum latence) and execute a kernel using 1/2, 1/4 … of the streaming processors inside the card from 2, 4 from mpi process.
And no wait the completion of each kernel of a mpi process before start the same kernel from an other mpi process …
I think that kernel overlapping is automatic with no direct user control. If your cublas kernel from process 0 occupies a multiple of the maximum blocks which can be active in the SMs, then the amount of overlap is negligible (if there even is some overlap). For non-multiple of the maximum blocks, then overlap will occur. I’ve tried multi-GPU on a single node workstation sharing the PCIe bus and it had worse performance than single-GPU (I used multi-GPU to increase data parallelism), so the other way around where two threads compete for the same GPU you get kernel scheduling overheads and concurrent process draining all computing power, and also you get OS racing for the PCIe bus. My experience is that even user enforced mutual exclusion will add a high overhead due to PCIe bus racing.
I’d risk saying, a process per GPU is fine (not fine if shared bus exists for some problems and corresponding dimensions) but two per GPU is a nuisance, don’t take this as general rule, but thus far my experience programming multi-gpu says so.
But still one can go around that. If possible it is better to increase data parallelism using one thread per GPU than to do so by increasing the number of threads per GPU with fixed data size. I don’t know the nature of your problem, but why do you spawn multiple CPU threads?