Running multiple processes on a GPU cause it stuck

I am running on Linux with CUDA 2.3 on a Tesla/C1060 card. I have two different kernels (say K1, K2) each runs within a separate Linux process (P1, P2). The CPU/GPU interface is pretty straightforward – cudaSetDevice, memcpy from cpu to gpu, kernel call, cudaThreadSynchronize, memcpy from gpu to cpu…

On a single GPU, I run P1 or P2 or P1+P2 in tight loops for days/weeks and everthing is fine. However, as soon as I add another instance of P2 into the picture, P1+P2+P2, either with all three running in tight loops, or the 2nd P2 as a periodical probe, things begin to fell apart. I would get sporadic " unspecified launch failure" (mostly from P2/K2), and eventually one of the processes would stuck in either “R” or “D” states and no valuation can be processed anymore. Basically all processes on the GPU stuck and the only way to recover the GPU is a reboot. EIPs from stuck processes suggest it either stuck in ioctl() or cudbgIpcCall(). CUDA 2.2 had the same problem.

What could be the possible cause?

tmurray has mentioned in previous threads that multiple processes sharing one CUDA device can potentially hit race conditions or deadlocks due to driver bugs. This is supposedly improved in later CUDA releases, though. Can you test the CUDA 3.0 beta?

Even beyond bugs, there’s no performance reason to do what you’re trying to do. Context switching between different GPU contexts is expensive.

Understood. However, the setup is due to other factors besides performance consideration.

My driver is /usr/lib64/ So this version does have some bugs in this scenario? Was I just lucky when I had two processes (P1+P2) running without any problem? Does this probelm have anything to do with how much GPU resources (global memory, shared memory…) these kernels use? Thanks!

190.18 has plenty of known issues at this point; please try with the latest 195/196.xx Linux driver (I can’t keep track of it).

How is context switching between different GPU contexts handled, if I launch two process each computing certain function on the GPU ?

When is the context switching carried out and how expensive is this operation ?

Could you explain what happens under the following different condition ?

(1) Kernel context switching between different GPU contexts

(2) Kernel context switching within one GPU contexts