bandwidthTest example throws cudaErrorCallRequiresNewerDriver error when launched via nv-nsight-cu-cli

Consider the bandwidthTest example from CUDA samples. It works as expected when compiled and launched normally.

$ /usr/local/cuda-12.3/bin/nvcc bandwidthTest.cu -o bandwidthTest

$ ./bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Tesla P40
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(GB/s)
   32000000                     11.8

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(GB/s)
   32000000                     13.2

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(GB/s)
   32000000                     283.9

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

However, it doesn’t work under Nsight Compute. Note that I have to use a standalone install of Nsight Compute 2019.5 because it’s the last version that supports Tesla P40 GPUs.

$ /usr/local/NVIDIA-Nsight-Compute-2019.5/nv-nsight-cu-cli ./bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...

==PROF== Connected to process 8325 (/data/research/cuda-playground/polar/bandwidthTest)
cudaGetDeviceProperties returned 36
-> API call is not supported in the installed CUDA driver
CUDA error at bandwidthTest.cu:256 code=36(cudaErrorCallRequiresNewerDriver) "cudaSetDevice(currentDevice)"
==PROF== Disconnected from process 8325
==ERROR== The application returned an error code (1)
==WARNING== No kernels were profiled
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option

Some system information

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.3 LTS
Release:        22.04
Codename:       jammy

$ sudo ubuntu-drivers debug
...... (verbose output omitted)
=== matching driver packages ===
nvidia-driver-525: installed: 525.147.05-0ubuntu0.22.04.1   available: 525.147.05-0ubuntu0.22.04.1 (auto-install)  [distro]  non-free  modalias: pci:v000010DEd00001B38sv000010DEsd000011D9bc03sc02i00  path: /sys/devices/pci0000:00/0000:00:03.0/0000:04:00.0  vendor: NVIDIA Corporation  model: GP102GL [Tesla P40]  support level: PB
nvidia-driver-390: installed: <none>   available: 390.157-0ubuntu0.22.04.2  [distro]  non-free  modalias: pci:v000010DEd00001B38sv000010DEsd000011D9bc03sc02i00  path: /sys/devices/pci0000:00/0000:00:03.0/0000:04:00.0  vendor: NVIDIA Corporation  model: GP102GL [Tesla P40]  support level: Legacy
nvidia-driver-545: installed: <none>   available: 545.23.08-0ubuntu1  [third party]  non-free  modalias: pci:v000010DEd00001B38sv000010DEsd000011D9bc03sc02i00  path: /sys/devices/pci0000:00/0000:00:03.0/0000:04:00.0  vendor: NVIDIA Corporation  model: GP102GL [Tesla P40]
nvidia-driver-525-server: installed: <none>   available: 525.147.05-0ubuntu0.22.04.1  [distro]  non-free  modalias: pci:v000010DEd00001B38sv000010DEsd000011D9bc03sc02i00  path: /sys/devices/pci0000:00/0000:00:03.0/0000:04:00.0  vendor: NVIDIA Corporation  model: GP102GL [Tesla P40]  support level: PB
nvidia-driver-535: installed: <none>   available: 535.86.10-0ubuntu1  [third party]  non-free  modalias: pci:v000010DEd00001B38sv000010DEsd000011D9bc03sc02i00  path: /sys/devices/pci0000:00/0000:00:03.0/0000:04:00.0  vendor: NVIDIA Corporation  model: GP102GL [Tesla P40]
nvidia-driver-450-server: installed: <none>   available: 450.248.02-0ubuntu0.22.04.1  [distro]  non-free  modalias: pci:v000010DEd00001B38sv000010DEsd000011D9bc03sc02i00  path: /sys/devices/pci0000:00/0000:00:03.0/0000:04:00.0  vendor: NVIDIA Corporation  model: GP102GL [Tesla P40]  support level: LTSB
nvidia-driver-470: installed: <none>   available: 470.223.02-0ubuntu0.22.04.1  [distro]  non-free  modalias: pci:v000010DEd00001B38sv000010DEsd000011D9bc03sc02i00  path: /sys/devices/pci0000:00/0000:00:03.0/0000:04:00.0  vendor: NVIDIA Corporation  model: GP102GL [Tesla P40]  support level: LTSB
nvidia-driver-470-server: installed: <none>   available: 470.223.02-0ubuntu0.22.04.1  [distro]  non-free  modalias: pci:v000010DEd00001B38sv000010DEsd000011D9bc03sc02i00  path: /sys/devices/pci0000:00/0000:00:03.0/0000:04:00.0  vendor: NVIDIA Corporation  model: GP102GL [Tesla P40]  support level: LTSB
nvidia-driver-418-server: installed: <none>   available: 418.226.00-0ubuntu5~0.22.04.1  [distro]  non-free  modalias: pci:v000010DEd00001B38sv000010DEsd000011D9bc03sc02i00  path: /sys/devices/pci0000:00/0000:00:03.0/0000:04:00.0  vendor: NVIDIA Corporation  model: GP102GL [Tesla P40]  support level: LTSB
nvidia-driver-535-server: installed: <none>   available: 535.129.03-0ubuntu0.22.04.1  [distro]  non-free  modalias: pci:v000010DEd00001B38sv000010DEsd000011D9bc03sc02i00  path: /sys/devices/pci0000:00/0000:00:03.0/0000:04:00.0  vendor: NVIDIA Corporation  model: GP102GL [Tesla P40]  support level: PB

$ nvidia-smi
Tue Jan  9 18:18:57 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P40           On   | 00000000:04:00.0 Off |                  Off |
| N/A   13C    P8     8W / 250W |      2MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P40           On   | 00000000:42:00.0 Off |                  Off |
| N/A   14C    P8     8W / 250W |      2MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

$  cat /proc/driver/nvidia/version                                                            (two_bit_quant)
NVRM version: NVIDIA UNIX x86_64 Kernel Module  525.147.05  Wed Oct 25 20:27:35 UTC 2023
GCC version:  gcc version 12.3.0 (Ubuntu 12.3.0-1ubuntu1~22.04)

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Nov__3_17:16:49_PDT_2023
Cuda compilation tools, release 12.3, V12.3.103
Build cuda_12.3.r12.3/compiler.33492891_0

$ ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "Tesla P40"
  CUDA Driver Version / Runtime Version          12.0 / 12.3
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 24446 MBytes (25632964608 bytes)
  (030) Multiprocessors, (128) CUDA Cores/MP:    3840 CUDA Cores
  GPU Max Clock rate:                            1531 MHz (1.53 GHz)
  Memory Clock rate:                             3615 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 3145728 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        98304 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 4 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "Tesla P40"
  CUDA Driver Version / Runtime Version          12.0 / 12.3
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 24446 MBytes (25632964608 bytes)
  (030) Multiprocessors, (128) CUDA Cores/MP:    3840 CUDA Cores
  GPU Max Clock rate:                            1531 MHz (1.53 GHz)
  Memory Clock rate:                             3615 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 3145728 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        98304 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 66 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from Tesla P40 (GPU0) -> Tesla P40 (GPU1) : No
> Peer access from Tesla P40 (GPU1) -> Tesla P40 (GPU0) : No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.0, CUDA Runtime Version = 12.3, NumDevs = 2
Result = PASS

nvidia-bug-report.log.gz (545.9 KB)

How do I fix this so that I can instrument my code with nv-nsight-cu-cli?

Only thing I notice is that you used nvcc from cuda toolkit 12.3 to compile and are running driver 525 which only supports 12.0. Maybe this has an influence on profiling? Please check if upgrading to driver 545 helps.

Nope, I’m still getting the same error after upgrading to the latest driver.
nvidia-bug-report.log.gz (572.8 KB)

$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  545.23.08  Mon Nov  6 23:49:37 UTC 2023
GCC version:  gcc version 12.3.0 (Ubuntu 12.3.0-1ubuntu1~22.04)

The maybe also ask in the nsight forum
https://forums.developer.nvidia.com/c/developer-tools/nsight-compute/114

1 Like

Thanks! Cross-posted here: bandwidthTest example throws cudaErrorCallRequiresNewerDriver error when launched via nv-nsight-cu-cli