I develop a code using mpi library and cublas. My code run runs on fermi nodes (2 nodes, 2 gpu per node and two amd 6 cores per nodes).
If I run my code 2 mpi process on the node and each mpi process has is own gpu, the results are excellent. But, If I have 4 mpi
process on the nodes with the following binding :
mpi task 0 bound on the core id 0
mpi task 1 bound on the core id 3
mpi task 2 bound on the core id 6
mpi task 3 bound on the core id 9
Thus the mpi process 0 and 1 will have a connection to the first gpu, whereas the mpi process 2 and 3 will be assigned to the second gpu.
When I call the cublas by a mpi process, it seems to have a race condition. Per example, the mpi process 1 will wait the completion of the cublas call
from mpi process 0, therefore I lost performance.
My question is : can I get a simultaneously access (or with a minimum latence) and execute a kernel using 1/2, 1/4 … of the streaming processors inside the card from 2, 4 from mpi process.
And no wait the completion of each kernel of a mpi process before start the same kernel from an other mpi process …