I am experiencing a problem similar to the ones described here:
Our application processes a large number of small jobs (<1s GPU time/job), distributed on a cluster. Each GPU is shared by multiple workers which acquire exclusive access before each job by obtaining a lock. Each worker runs in its own thread and uses its own CUDA context. A job consists of copying data to the GPU, running multiple kernels and copying the data back. CUDA events are used to synchronize the kernel launches within a stream.
For a very small portion of these jobs (~ one in 10000 to 100000), the cuEventSynchronize() call used to synchronize the last kernel in the pipeline never returns and just hangs indefinitely. While other workers are not actively launching kernels or copy data at the same time, they might use stuff like cuCtxCreate(). As soon as one thread hangs, all other threads will hang too if they call into the driver API. If I add a cuCtxSynchronize() call after every launch, the worker will be stuck there, usually after one specific kernel, which happens to have the longest runtime of the kernels in the pipeline.
This kernel does not contain any data-dependent loops or thread synchronization which could cause deadlocks. While there are shuffle accesses within warps, we already tried to replace them with (synchronized) shared memory accesses, without success.
Since this is a linux setup without X, there is no launch timeout, leading to indefinitely stuck workers without any error. nvidia-smi will report 0% utilization in this case, suggesting that there is indeed no kernel running anymore.
When such a process is killed, all its jobs are distributed to other machines where they can be processed successfully.
The problem doesn’t occur, if I reduce the number of worker threads to one per GPU (still two worker threads, as the machines have 2 GPUs). For me, this, together with the fact that other threads with other contexts can be affected if a worker hangs, suggests that there is something going wrong when some API calls happen concurrently.
I am not aware of any restrictions concerning two threads using different contexts. Can somebody make a definitive statement about that? Are there any driver API calls that must not happen concurrently (in distinct contexts)?
The problem has occurred with any of the combinations between driver versions 340.32 and 346.59 and CUDA toolkits 5.5.22 and 7.0.28 on our Tesla K20Xm GPUs. cuda_memtest didn’t find any errors, unlike in the two links posted above.
Could this be a driver/hardware issue? Is there any way to detect and recover from the problem when it happens? Since all driver API calls get stuck, and there is no configurable timeout we currently have to kill the process manually, which is not feasible in production.