vkCreateDevice failed with ERROR_INITIALIZATION_FAILED on CentOS 7 Cluster

Hello everyone,

we’re trying to get Vulkan (and the SDK) to run on our GPU cluster powering a multi-projector CAVE. However, we’re having trouble getting it to work even on a single computer of the cluster.
The main issue is that any call to vkCreateDevice, no matter from where, fails with ERROR_INITIALIZATION_FAILED. We tried all steps on two Clusters with different hardware:

Hardware

  • Cluster 1: 2x Quadro P6000
  • Cluster 2: 1x GTX 780ti
  • Shared cluster filesystem, but the SDK was explicitly tested on the local filesystem of a single node with a regular monitor attached to one GPU.

Software

  • CentOS 7.8
  • Packages:
    • vulkan.x86_64 (1.1.97.0-1.e17)
    • vulkan-devel.x86_64 (1.1.97.0-1.e17)
    • Vulkan-filesystem.noarch (1.1.97.0-1.e17)
    • gcc 4.8.5 (CentOS default, unloaded)
    • gcc 7.3.0 (loaded via module load gcc/7)
    • Nvidia Unix Driver 450.80.02 [tested with various other versions as well]
  • nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P6000        Off  | 00000000:3B:00.0 Off |                    0 |
| 26%   18C    P8     9W / 250W |     65MiB / 22916MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro P6000        Off  | 00000000:86:00.0 Off |                    0 |
| 26%   24C    P8     9W / 250W |    177MiB / 22916MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     70106      G   /usr/bin/X                         62MiB |
|    1   N/A  N/A     70106      G   /usr/bin/X                         64MiB |
|    1   N/A  N/A     70164      G   /usr/bin/gnome-shell              109MiB |
+-----------------------------------------------------------------------------+
  • nvidia_icd.json
{
    "file_format_version" : "1.0.0",
    "ICD": {
        "library_path": "libGLX_nvidia.so.0",
        "api_version" : "1.2.133"
    }
}
  • Alternatively tried with
{
    "file_format_version" : "1.0.0",
    "ICD": {
        "library_path": "/lib64/libGLX_nvidia.so.0",
        "api_version" : "1.2.133"
    }
}

Issues with installed system libraries

Because the system libraries did not work, we tried it with the SDK and got the following issues:

Issues with vulkansdk source build (with and without system libraries installed):

Attempts to fix the issue and get more information:

We’re starting to arrive at our wits’ end here, any input on what else we can try or what we might have missed in the logs or steps that we took would be greatly appreciated. We also created an issue on the lunarg website (https://vulkan.lunarg.com/issue/view/5fa3ffd35df112a7567973f4).

Thank you all in advance for any help!
David Gilbert

It turns out that the Compute Mode was set to Exclusive_Process via nvidia-smi -c 3. Changing the Compute Mode back to Default (nvidia-smi -c 0) fixed the issue.

It would be awesome if anyone here could shed some light on why this happens, and it might be worth to mention this issue in the driver docs somewhere.