Quadro P5000 with CUDA 11.6? no kernel image is available for execution on the device

Attempting to use Quadro P5000 with my application and getting “no kernel image is available for execution on the device”. I tried to run this down best i can, i assume it has to do something with the CUDA Capability? Hopefully someone can break this down and explain to it me, and what a possible resolution is. My best guess is that 11.6 doesnt have binary for P5000? and only solution is to build from source? but thats just a guess.

here is some related info, let me know if can provide anything else

application logs

Apr 22 14:57:13 PC2-02 lotus-worker[23409]: {"level":"info","ts":"2022-04-22T14:57:13.002-0400","logger":"bellperson::gpu::locks","caller":"/root/.cargo/registry/src/github.com-1ecc6299db9ec823/bellperson-0.18.2/src/gpu/locks.rs:173","msg":"GPU is available for FFT!"}
Apr 22 14:57:13 PC2-02 lotus-worker[23409]: {"level":"debug","ts":"2022-04-22T14:57:13.002-0400","logger":"bellperson::gpu::locks","caller":"/root/.cargo/registry/src/github.com-1ecc6299db9ec823/bellperson-0.18.2/src/gpu/locks.rs:21","msg":"Acquiring GPU lock at \"/home/filecoin/tmp/GPU-4/bellman.gpu.lock\" ..."}
Apr 22 14:57:13 PC2-02 lotus-worker[23409]: {"level":"debug","ts":"2022-04-22T14:57:13.002-0400","logger":"bellperson::gpu::locks","caller":"/root/.cargo/registry/src/github.com-1ecc6299db9ec823/bellperson-0.18.2/src/gpu/locks.rs:25","msg":"GPU lock acquired!"}
Apr 22 14:57:13 PC2-02 lotus-worker[23409]: {"level":"info","ts":"2022-04-22T14:57:13.002-0400","logger":"bellperson::gpu::program","caller":"/root/.cargo/registry/src/github.com-1ecc6299db9ec823/bellperson-0.18.2/src/gpu/program.rs:72","msg":"Using kernel on CUDA."}
Apr 22 14:57:13 PC2-02 lotus-worker[23409]: {"level":"error","ts":"2022-04-22T14:57:13.002-0400","logger":"bellperson::gpu::fft","caller":"/root/.cargo/registry/src/github.com-1ecc6299db9ec823/bellperson-0.18.2/src/gpu/fft.rs:141","msg":"Cannot initialize kernel for device 'Quadro P5000'! Error: GPU tools error: Cuda Error: \"no kernel image is available for execution on the device\""}
Apr 22 14:57:13 PC2-02 lotus-worker[23409]: {"level":"debug","ts":"2022-04-22T14:57:13.002-0400","logger":"bellperson::gpu::locks","caller":"/root/.cargo/registry/src/github.com-1ecc6299db9ec823/bellperson-0.18.2/src/gpu/locks.rs:32","msg":"GPU lock released!"}
Apr 22 14:57:13 PC2-02 lotus-worker[23409]: {"level":"warn","ts":"2022-04-22T14:57:13.002-0400","logger":"bellperson::domain","caller":"/root/.cargo/registry/src/github.com-1ecc6299db9ec823/bellperson-0.18.2/src/domain.rs:527","msg":"Cannot instantiate GPU FFT kernel! Error: GPUError: No working GPUs found!"}

DeviceQuery

root@PC2-02:/usr/local/cuda/bin# /usr/local/cuda-11.6/extras/demo_suite/deviceQuery
/usr/local/cuda-11.6/extras/demo_suite/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 6 CUDA Capable device(s)

Device 0: "NVIDIA RTX A5000"
  CUDA Driver Version / Runtime Version          11.6 / 11.6
  CUDA Capability Major/Minor version number:    8.6
  Total amount of global memory:                 24256 MBytes (25434587136 bytes)
  (64) Multiprocessors, (128) CUDA Cores/MP:     8192 CUDA Cores
  GPU Max Clock rate:                            1695 MHz (1.70 GHz)
  Memory Clock rate:                             8001 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 6291456 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "NVIDIA RTX A5000"
  CUDA Driver Version / Runtime Version          11.6 / 11.6
  CUDA Capability Major/Minor version number:    8.6
  Total amount of global memory:                 24256 MBytes (25434587136 bytes)
  (64) Multiprocessors, (128) CUDA Cores/MP:     8192 CUDA Cores
  GPU Max Clock rate:                            1695 MHz (1.70 GHz)
  Memory Clock rate:                             8001 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 6291456 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 33 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 2: "NVIDIA RTX A5000"
  CUDA Driver Version / Runtime Version          11.6 / 11.6
  CUDA Capability Major/Minor version number:    8.6
  Total amount of global memory:                 24256 MBytes (25434587136 bytes)
  (64) Multiprocessors, (128) CUDA Cores/MP:     8192 CUDA Cores
  GPU Max Clock rate:                            1695 MHz (1.70 GHz)
  Memory Clock rate:                             8001 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 6291456 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 65 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 3: "NVIDIA RTX A5000"
  CUDA Driver Version / Runtime Version          11.6 / 11.6
  CUDA Capability Major/Minor version number:    8.6
  Total amount of global memory:                 24256 MBytes (25434587136 bytes)
  (64) Multiprocessors, (128) CUDA Cores/MP:     8192 CUDA Cores
  GPU Max Clock rate:                            1695 MHz (1.70 GHz)
  Memory Clock rate:                             8001 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 6291456 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 97 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 4: "Quadro P5000"
  CUDA Driver Version / Runtime Version          11.6 / 11.6
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 16279 MBytes (17069375488 bytes)
  (20) Multiprocessors, (128) CUDA Cores/MP:     2560 CUDA Cores
  GPU Max Clock rate:                            1734 MHz (1.73 GHz)
  Memory Clock rate:                             4513 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 2097152 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 129 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 5: "Quadro P5000"
  CUDA Driver Version / Runtime Version          11.6 / 11.6
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 16279 MBytes (17069375488 bytes)
  (20) Multiprocessors, (128) CUDA Cores/MP:     2560 CUDA Cores
  GPU Max Clock rate:                            1734 MHz (1.73 GHz)
  Memory Clock rate:                             4513 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 2097152 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 225 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from NVIDIA RTX A5000 (GPU0) -> NVIDIA RTX A5000 (GPU1) : Yes
> Peer access from NVIDIA RTX A5000 (GPU0) -> NVIDIA RTX A5000 (GPU2) : Yes
> Peer access from NVIDIA RTX A5000 (GPU0) -> NVIDIA RTX A5000 (GPU3) : Yes
> Peer access from NVIDIA RTX A5000 (GPU0) -> Quadro P5000 (GPU4) : No
> Peer access from NVIDIA RTX A5000 (GPU0) -> Quadro P5000 (GPU5) : No
> Peer access from NVIDIA RTX A5000 (GPU1) -> NVIDIA RTX A5000 (GPU0) : Yes
> Peer access from NVIDIA RTX A5000 (GPU1) -> NVIDIA RTX A5000 (GPU2) : Yes
> Peer access from NVIDIA RTX A5000 (GPU1) -> NVIDIA RTX A5000 (GPU3) : Yes
> Peer access from NVIDIA RTX A5000 (GPU1) -> Quadro P5000 (GPU4) : No
> Peer access from NVIDIA RTX A5000 (GPU1) -> Quadro P5000 (GPU5) : No
> Peer access from NVIDIA RTX A5000 (GPU2) -> NVIDIA RTX A5000 (GPU0) : Yes
> Peer access from NVIDIA RTX A5000 (GPU2) -> NVIDIA RTX A5000 (GPU1) : Yes
> Peer access from NVIDIA RTX A5000 (GPU2) -> NVIDIA RTX A5000 (GPU3) : Yes
> Peer access from NVIDIA RTX A5000 (GPU2) -> Quadro P5000 (GPU4) : No
> Peer access from NVIDIA RTX A5000 (GPU2) -> Quadro P5000 (GPU5) : No
> Peer access from NVIDIA RTX A5000 (GPU3) -> NVIDIA RTX A5000 (GPU0) : Yes
> Peer access from NVIDIA RTX A5000 (GPU3) -> NVIDIA RTX A5000 (GPU1) : Yes
> Peer access from NVIDIA RTX A5000 (GPU3) -> NVIDIA RTX A5000 (GPU2) : Yes
> Peer access from NVIDIA RTX A5000 (GPU3) -> Quadro P5000 (GPU4) : No
> Peer access from NVIDIA RTX A5000 (GPU3) -> Quadro P5000 (GPU5) : No
> Peer access from Quadro P5000 (GPU4) -> NVIDIA RTX A5000 (GPU0) : No
> Peer access from Quadro P5000 (GPU4) -> NVIDIA RTX A5000 (GPU1) : No
> Peer access from Quadro P5000 (GPU4) -> NVIDIA RTX A5000 (GPU2) : No
> Peer access from Quadro P5000 (GPU4) -> NVIDIA RTX A5000 (GPU3) : No
> Peer access from Quadro P5000 (GPU4) -> Quadro P5000 (GPU5) : Yes
> Peer access from Quadro P5000 (GPU5) -> NVIDIA RTX A5000 (GPU0) : No
> Peer access from Quadro P5000 (GPU5) -> NVIDIA RTX A5000 (GPU1) : No
> Peer access from Quadro P5000 (GPU5) -> NVIDIA RTX A5000 (GPU2) : No
> Peer access from Quadro P5000 (GPU5) -> NVIDIA RTX A5000 (GPU3) : No
> Peer access from Quadro P5000 (GPU5) -> Quadro P5000 (GPU4) : Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.6, CUDA Runtime Version = 11.6, NumDevs = 6, Device0 = NVIDIA RTX A5000, Device1 = NVIDIA RTX A5000, Device2 = NVIDIA RTX A5000, Device3 = NVIDIA RTX A5000, Device4 = Quadro P5000, Device5 = Quadro P5000
Result = PASS

SMI

root@PC2-02:/usr/local/cuda/bin# nvidia-smi 
Fri Apr 22 15:29:41 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A5000    On   | 00000000:01:00.0 Off |                  Off |
| 30%   29C    P8    17W / 230W |      1MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A5000    On   | 00000000:21:00.0 Off |                  Off |
| 30%   31C    P8    19W / 230W |      1MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A5000    On   | 00000000:41:00.0 Off |                  Off |
| 30%   29C    P8    15W / 230W |      1MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A5000    On   | 00000000:61:00.0 Off |                  Off |
| 30%   29C    P8    15W / 230W |      1MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Quadro P5000        On   | 00000000:81:00.0 Off |                  Off |
| 26%   28C    P8     5W / 180W |    107MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Quadro P5000        On   | 00000000:E1:00.0 Off |                  Off |
| 26%   28C    P8     6W / 180W |      1MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    4   N/A  N/A     23910      C   /usr/local/bin/lotus-worker       103MiB |
+-----------------------------------------------------------------------------+

The message means that the application (which?) you’re trying to run does not have suitable cuda kernels for your gpus (and also jit can’t be used). So your application has to be recompiled with cc 8.6 and 6.1 enabled so it works for all your gpus. For a workaround, you could use CUDA_VISIBLE_DEVICES to hide the P5000 to check if your application holds a cuda kernel for the A5000.