Hello everyone,
we’re trying to get Vulkan (and the SDK) to run on our GPU cluster powering a multi-projector CAVE. However, we’re having trouble getting it to work even on a single computer of the cluster.
The main issue is that any call to vkCreateDevice, no matter from where, fails with ERROR_INITIALIZATION_FAILED. We tried all steps on two Clusters with different hardware:
Hardware
- Cluster 1: 2x Quadro P6000
- Cluster 2: 1x GTX 780ti
- Shared cluster filesystem, but the SDK was explicitly tested on the local filesystem of a single node with a regular monitor attached to one GPU.
Software
- CentOS 7.8
- Packages:
- vulkan.x86_64 (1.1.97.0-1.e17)
- vulkan-devel.x86_64 (1.1.97.0-1.e17)
- Vulkan-filesystem.noarch (1.1.97.0-1.e17)
- gcc 4.8.5 (CentOS default, unloaded)
- gcc 7.3.0 (loaded via module load gcc/7)
- Nvidia Unix Driver 450.80.02 [tested with various other versions as well]
- nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro P6000 Off | 00000000:3B:00.0 Off | 0 |
| 26% 18C P8 9W / 250W | 65MiB / 22916MiB | 0% E. Process |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Quadro P6000 Off | 00000000:86:00.0 Off | 0 |
| 26% 24C P8 9W / 250W | 177MiB / 22916MiB | 0% E. Process |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 70106 G /usr/bin/X 62MiB |
| 1 N/A N/A 70106 G /usr/bin/X 64MiB |
| 1 N/A N/A 70164 G /usr/bin/gnome-shell 109MiB |
+-----------------------------------------------------------------------------+
- nvidia_icd.json
{
"file_format_version" : "1.0.0",
"ICD": {
"library_path": "libGLX_nvidia.so.0",
"api_version" : "1.2.133"
}
}
- Alternatively tried with
{
"file_format_version" : "1.0.0",
"ICD": {
"library_path": "/lib64/libGLX_nvidia.so.0",
"api_version" : "1.2.133"
}
}
Issues with installed system libraries
- Calling
vulkaninfo
works and produces the following output: https://vulkan.lunarg.com/issue/file/5fa3ffd35df112a7567973f4/upload/1604583441_vulkaninfo_system.log - However, trying to start any other vulkan application fails to create the device, e.g. using a different vulkaninfo from the SDK:
ERROR at /tmp/vulkan/1.2.154.0/source/Vulkan-Tools/vulkaninfo/vulkaninfo.h:1515:vkCreateDevice failed with ERROR_INITIALIZATION_FAILED
- Any other application also fails (vkcube, hologram, Unreal Engine)
Because the system libraries did not work, we tried it with the SDK and got the following issues:
Issues with vulkansdk source build (with and without system libraries installed):
- Dependencies all install well
-
./vulkansdk all
runs through flawlessly (with gcc 7) -
source setup-env.sh
works and sets the correct paths - Pre-built
/tmp/vulkan/1.2.154.0/x86_64/bin/vulkaninfo Fails at vkCreateDevice
(https://vulkan.lunarg.com/issue/file/5fa3ffd35df112a7567973f4/upload/1604583439_vulkaninfo_sdk.log) - Source-built vulkaninfo fails exactly in the same way
- Source and pre-built vkcube and API-Samples > 02 also fail.
Attempts to fix the issue and get more information:
-
vulkaninfo logs:
- System package
- SDK
- SDK with validation layers turned on and setting an explicit icd path to the nvidia driver
export VK_INSTANCE_LAYERS=VK_LAYER_LUNARG_api_dump:VK_LAYER_KHRONOS_validation
VK_ICD_FILENAMES=/etc/vulkan/icd.d/nvidia_icd.json
- https://vulkan.lunarg.com/issue/file/5fa3ffd35df112a7567973f4/upload/1604583442_vulkaninfo_validation_explicitNvidiaICD.log
- With the above and additionally enabled all output from the loader:
-
Running vkvia for different configurations yields the following output, all of the failing at createDevice:
- vkvia with SDK
- vkvia with validation layers turned on and setting an explicit icd path to the nvidia driver
export VK_INSTANCE_LAYERS=VK_LAYER_LUNARG_api_dump:VK_LAYER_KHRONOS_validation
VK_ICD_FILENAMES=/etc/vulkan/icd.d/nvidia_icd.json
- LunarG VIA
- vkvia with the above and additionally enabled all output from the loader:
VK_LOADER_DEBUG=all
- LunarG VIA
-
Additionally, we tried running
strace -f
to see if anything suspicious was happening there, but found nothing interesting for now.-
strace -f vulkaninfo
with system libraries -
strace -f vulkaninfo
with SDK libraries -
strace -f vulkaninfo
with SDK libraries and all debug output set
-
-
We also tried setting some other variables, but didn’t explicitly create logs for it as they all changed nothing:
-
__NV_PRIME_RENDER_OFFLOAD=1
-
__VK_LAYER_NV_optimus=NVIDIA_only
-
__GLX_VENDOR_LIBRARY_NAME=nvidia
We’re starting to arrive at our wits’ end here, any input on what else we can try or what we might have missed in the logs or steps that we took would be greatly appreciated. We also created an issue on the lunarg website (LunarXchange).
Thank you all in advance for any help!
David Gilbert