I have an application that use GPU (tesla 1060). This application calls only “simple” cuda function, such as cudaMalloc, cudamemcpy or kernel execution.
If I “connect” 4 MPI task to one GPU (so, I call cudaSetDevice(0) on each MPI task), my application is executed without problem.
But if I try with more of 4 tasks, for exemple 6 tasks, I get error from each MPI task when I do cudaMalloc (unknown error).
The memory used is very small, so it is not a problem of GPU memory saturation .
Is the number of unix process that acces to one GPU limited ?
I’ve have 8 processes all use one GPU, so I don’t think 4 is the limit.
I haven’t really done much with MPI, so I don’t know the specifics of it, but perhaps you could make some sort of singleton worker class that handles the GPU functions. This means the object only gets created once, no matter how many threads are using it. Then you could have a ‘task’ queue that each process can put a ‘task’ into (i.e. a kernel call of some kind, and any associated memory writes/reads) and the worker class will execute them in order.
Ok and your process performs cudalMalloc from each MPI task ? Do you use some kind of synchronisation (semaphore…) when you access to the GPU ?
These are not MPI processes. I am just running 8 copies of the same program. The processes don’t talk to each other, so there is no semaphore or any other synchronization.