Nvprof works but nsight compute gives "no kernels were profiled" warning

I have a titan Volta GPU.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P620         Off  | 00000000:C1:00.0 Off |                  N/A |
| 34%   45C    P8    N/A /  N/A |      2MiB /  1999MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA TITAN V      Off  | 00000000:E1:00.0 Off |                  N/A |
| 32%   47C    P8    28W / 250W |      4MiB / 12066MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

I am trying to profile some code. For sake of brevity, let’s consider a simple vector add code I picked up here https://www.olcf.ornl.gov/tutorials/cuda-vector-addition/.

nvprof works fine

nvprof --devices 1 ./a.out
==208985== NVPROF is profiling process 208985, command: ./a.out
final result: 1.000000
==208985== Profiling application: ./a.out
==208985== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   68.54%  145.57us         2  72.784us  68.160us  77.408us  [CUDA memcpy HtoD]
                   29.17%  61.952us         1  61.952us  61.952us  61.952us  [CUDA memcpy DtoH]
                    2.29%  4.8640us         1  4.8640us  4.8640us  4.8640us  vecAdd(double*, double*, double*, int)
      API calls:   98.08%  211.47ms         3  70.491ms  11.650us  211.27ms  cudaMalloc
                    0.64%  1.3889ms         3  462.96us  14.429us  1.2802ms  cudaFree
                    0.42%  910.12us         3  303.37us  116.25us  629.10us  cudaMemcpy
                    0.41%  882.48us         2  441.24us  150.76us  731.72us  cuDeviceTotalMem
                    0.38%  817.99us       202  4.0490us     350ns  175.41us  cuDeviceGetAttribute
                    0.04%  96.391us         2  48.195us  36.210us  60.181us  cuDeviceGetName
                    0.01%  29.326us         1  29.326us  29.326us  29.326us  cudaLaunchKernel
                    0.01%  12.221us         2  6.1100us  2.8180us  9.4030us  cuDeviceGetPCIBusId
                    0.00%  3.3020us         4     825ns     328ns  1.8740us  cuDeviceGet
                    0.00%  3.2020us         3  1.0670us     558ns  2.0160us  cuDeviceGetCount
                    0.00%  1.1250us         2     562ns     468ns     657ns  cuDeviceGetUuid

ncu however does not detect any kernels.

ncu --devices 1 ./a.out
==PROF== Connected to process 209016 (/home/adityap/a.out)
final result: 1.000000
==PROF== Disconnected from process 209016
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.

I tried installing older CUDA versions. As I am on debian, the ubuntu packages seem to be broken. The run file errors for driver. If I install only CUDA toolkit(without the driver), it installs, but ncu doesn’t work nevertheless.

The same GPU worked fine on another machine, the issue only occured when we moved it to another machine. So my guess is that there is some specific version of CUDA toolkit + driver that works for this device. Is anyone aware of it?