CUDA runtime error: "all CUDA-capable devices are busy or unavailable"

I am using Ubuntu 16.04+nvidia-375 driver with three GPUs (Compute Mode=Default):

  • GTX Titan X
  • Quadro M6000
  • GTX 580

When I run example 0_Simple/simpleMultiGPU I encounter “code=46(cudaErrorDevicesUnavailable)”, that translates to all CUDA-capable devices are busy or unavailable.

The same error I encounter in my own code:

  • sometimes only Titan X and M6000 works, while GTX 580 is busy
  • sometimes (after application relaunch) only GTX 580 works well, while Titan X and M6000 are busy

Also these launches work:

  • CUDA_VISIBLE_DEVICES=0 ./simpleMultiGPU
  • CUDA_VISIBLE_DEVICES=1 ./simpleMultiGPU
  • CUDA_VISIBLE_DEVICES=2 ./simpleMultiGPU
  • CUDA_VISIBLE_DEVICES=0,2 ./simpleMultiGPU

But these don’t work:

  • CUDA_VISIBLE_DEVICES=0,1 ./simpleMultiGPU
  • CUDA_VISIBLE_DEVICES=1,2 ./simpleMultiGPU
  • CUDA_VISIBLE_DEVICES=0,1,2 ./simpleMultiGPU

GTX 580 corresponds to digit “1”. So there are some incompatibility between GTX 580 and other two GPUs? How can it be fixed?

More interesting is that OpenCL version of my application works well on all three GPUs.

nvidia-smi output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39                 Driver Version: 375.39                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 0000:05:00.0      On |                  N/A |
| 22%   40C    P8    16W / 250W |    329MiB / 12205MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Quadro M6000        Off  | 0000:09:00.0     Off |                  Off |
| 26%   43C    P5    17W / 250W |      1MiB / 12207MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 580     Off  | 0000:0B:00.0     N/A |                  N/A |
| 41%   40C    P0    N/A /  N/A |      0MiB /  3004MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+

A lot of users have the same problem for seven years (on this forum - https://devtalk.nvidia.com/search/more/sitecommentsearch/all%20CUDA-capable%20devices%20are%20busy%20or%20unavailable/, on stackoverflow and in other places). The problem is more strange because OpenCL works fine (in despite of working over CUDA backend).

What is the reason of such behaivour?

I believe that this bug is critical and it makes some users of our application very sad, so I reported this bug with ID=1944892.

I will try to duplicate all updates from bug report to this topic.

Update about absence of any updates in bugtracker :(