System with 4 GPUs, 2 of them in SLI, crashes with cudaErrorDevicesUnavailable

I have a Linux system with 4 GTX 1070s, all the exact same manufacturer/model. Everything works perfectly when SLI is completely disabled. When I enable SLI I get the following nvidia-smi output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.42                 Driver Version: 390.42                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070    Off  | 00000000:02:00.0  On |                  N/A |
|  0%   46C    P8     9W / 180W |     47MiB /  8191MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1070    Off  | 00000000:03:00.0 Off |                  N/A |
|  0%   47C    P8     9W / 180W |     47MiB /  8191MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 1070    Off  | 00000000:87:00.0 Off |                  N/A |
|  0%   37C    P8    13W / 180W |      2MiB /  8119MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 1070    Off  | 00000000:88:00.0 Off |                  N/A |
|  0%   46C    P8     9W / 180W |      2MiB /  8119MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      2342      G   /usr/lib/xorg/Xorg                            14MiB |
|    1      2342      G   /usr/lib/xorg/Xorg                            14MiB |
+-----------------------------------------------------------------------------+

Only when the X server with SLI is running, the first two GPUs show a total memory of 8191MB, otherwise all four GPUs show 8119MB, but that’s not the problem (or at least I think so).

If I enable SLI on the first 2 GPUs (indices 0,1), I can still use CUDA, but only on the pair with SLI or the pair without SLI, which I make visible to the CUDA application by setting the environment variable CUDA_VISIBLE_DEVICES=“0,1” or CUDA_VISIBLE_DEVICES=“2,3”. Even if I make all 4 GPUs visible to the CUDA application (confirmed via cudaGetDeviceCount()) but only use one of the pairs, everything works fine.

Now when I attempt to make all 4 GPUs visible, the application always crashes with cudaErrorDevicesUnavailable when trying to do anything on the second pair of GPUs, e.g., creating one stream per device on a loop will fail when it the stream creation reaches the first GPU of the second pair, no matter which pair is visible first. I tried first with my own application, but I can confirm the same issue with sample applications 0_Simple/simpleMultiGPU and 4_Finance/MonteCarloMultiGPU from the CUDA toolkit.

When I run two separate processes each addressing one pair, everything works too, so I can narrow it down to a single process trying to access both pairs. My application is very complex, and I have both CUDA code for computing and CUDA-GL interoperability for rendering, so decoupling compute and rendering part is not really an option.

I can’t seem to find documentation anywhere whether this is an expected behavior or not. Could someone point out if I missed something?

Thanks,
Peter