I have an application that use GPU (tesla 1060). This application calls only “simple” cuda function, such as cudaMalloc, cudamemcpy or kernel execution.
If I “connect” 4 MPI task to one GPU (so, I call cudaSetDevice(0) on each MPI task), my application is executed without problem.
But if I try with more of 4 tasks, for exemple 6 tasks, I get error from each MPI task when I do cudaMalloc (unknown error).
The memory used is very small, so it is not a problem of GPU memory saturation .
Is the number of unix process that acces to one GPU limited ?
I haven’t really done much with MPI, so I don’t know the specifics of it, but perhaps you could make some sort of singleton worker class that handles the GPU functions. This means the object only gets created once, no matter how many threads are using it. Then you could have a ‘task’ queue that each process can put a ‘task’ into (i.e. a kernel call of some kind, and any associated memory writes/reads) and the worker class will execute them in order.
These are not MPI processes. I am just running 8 copies of the same program. The processes don’t talk to each other, so there is no semaphore or any other synchronization.