I recently upgraded the NVIDIA driver from 515 to 535 (tried 525 as well) and noticed the bug while running clinfo and one of our applications using the new CUDA 12. clinfo
and the application are run inside a docker container. These issues are not there when running natively.
This is the relevant clinfo
output:
Platform Name NVIDIA CUDA
Number of devices 1
Device Name NVIDIA RTX A4000
Device Vendor NVIDIA Corporation
Device Vendor ID 0x10de
Device Version OpenCL 3.0 CUDA
Device UUID 6fab8f10-ccd0-3111-3f1a-e8ea712c3184
Driver UUID 6fab8f10-ccd0-3111-3f1a-e8ea712c3184
Valid Device LUID No
Device LUID 0000-4000636c5f6b
Device Node Mask 0
Device Numeric Version 0xc00000 (3.0.0)
Driver Version 535.86.05
Device OpenCL C Version OpenCL C 1.2
Device OpenCL C all versions OpenCL C 0x400000 (1.0.0)
OpenCL C 0x401000 (1.1.0)
OpenCL C 0x402000 (1.2.0)
OpenCL C 0xc00000 (3.0.0)
Device OpenCL C features __opencl_c_fp64 0xc00000 (3.0.0)
__opencl_c_images 0xc00000 (3.0.0)
__opencl_c_int64 0xc00000 (3.0.0)
__opencl_c_3d_image_writes 0xc00000 (3.0.0)
Latest comfornace test passed v2022-10-05-00
Device Type GPU
Device Topology (NV) PCI-E, 0000:00:00.4
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 48
Max clock frequency 1560MHz
Compute Capability (NV) 8.6
Device Partition (core)
Max number of sub-devices 1
Supported partition types None
Supported affinity domains (n/a)
Max work item dimensions 3
Max work item sizes 1024x1024x64
Max work group size 1024
Preferred work group size multiple (device) 32
=== CL_PROGRAM_BUILD_LOG ===
Preferred work group size multiple (kernel) <getWGsizes:1504: create kernel : error -45>
Warp size (NV) 32
...
The error -45
is not present when using NVIDIA driver 515 or older, and just changing the driver version with everything else constant gives this issue. I believe there is a correlation between this error and the issue encountered in the application with build program (Failed to compile OpenCL program - Error: -11
). When running the application, the program build log is empty.
I have used different OpenCL packages available (cuda-opencl
available in NVIDIA repo and libOpenCL1
available as part of openSUSE main repo) and still see the same output. libnvidia-opencl.so
seems to be different between the two driver installations so I am not sure if the issue lies there. Any help is greatly appreciated.