Nsight Compute cannot access GPU performance counters

Ubuntu 22.04 Server + GUI desktop
CUDA 12.1

Running ncu-ui I get this error when profiling:

Error: ERR_NVGPUCTRPERM - The user does not have permission to access NVIDIA GPU Performance Counters on the target device. For instructions on enabling permissions and to get more information see **https://developer.nvidia.com/ERR_NVGPUCTRPERM**,,,,,,

The two suggestions on the linked page were to run with elevated privileges or enable access permanently.

  1. Run with elevated privileges:

…Trying sudo

sscott@demo:~$ sudo ncu-ui
QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-root'
QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-root'
QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-root'
Cannot mix incompatible Qt library (5.15.3) with this library (5.15.2)
scott@demo:~$ sudo -E ncu-ui
QStandardPaths: runtime directory '/run/user/1001' is not owned by UID 0, but a directory permissions 0700 owned by UID 1001 GID 1001
QStandardPaths: runtime directory '/run/user/1001' is not owned by UID 0, but a directory permissions 0700 owned by UID 1001 GID 1001
QStandardPaths: runtime directory '/run/user/1001' is not owned by UID 0, but a directory permissions 0700 owned by UID 1001 GID 1001
Cannot mix incompatible Qt library (5.15.3) with this library (5.15.2)

…Trying setcap on executable

sscott@demo:~/esat-rx$ sudo setcap 'cap_sys_admin=+ep' /opt/nvidia/nsight-compute/2023.1.0/host/linux-desktop-glibc_2_11_3-x64/ncu-ui.bin
[sudo] password for sscott: 
sscott@demo:~/esat-rx$ getcap /opt/nvidia/nsight-compute/2023.1.0/host/linux-desktop-glibc_2_11_3-x64/ncu-ui.bin
/opt/nvidia/nsight-compute/2023.1.0/host/linux-desktop-glibc_2_11_3-x64/ncu-ui.bin cap_sys_admin=ep
sscott@demo:~/esat-rx$ ncu-ui
/opt/nvidia/nsight-compute/2023.1.0/host/linux-desktop-glibc_2_11_3-x64/ncu-ui.bin: error while loading shared libraries: libAppLib.so: cannot open shared object file: No such file or directory
  1. Enable access permanently
sscott@demo:~/esat-rx$ cat /etc/modprobe.d/nvidia-profiling.conf 
options nvidia "NVreg_RestrictProfilingToAdminUsers=0"

and rebooted
same error:  ERR_NVGPUCTRPERM

How do I get this to work?

For (2), assuming you copied the string from the website, rather than re-typing it yourself, can you try to replace the " characters in the file? We have seen cases where copy-and-paste would lead to a character with the same symbol but a different encoding, which would then not be recognized by the kernel module.

sscott@demo:~/esat-rx$ od -t x1 -c /etc/modprobe.d/nvidia-profiling.conf 
0000000  6f  70  74  69  6f  6e  73  20  6e  76  69  64  69  61  20  22
          o   p   t   i   o   n   s       n   v   i   d   i   a       "
0000020  4e  56  72  65  67  5f  52  65  73  74  72  69  63  74  50  72
          N   V   r   e   g   _   R   e   s   t   r   i   c   t   P   r
0000040  6f  66  69  6c  69  6e  67  54  6f  41  64  6d  69  6e  55  73
          o   f   i   l   i   n   g   T   o   A   d   m   i   n   U   s
0000060  65  72  73  3d  30  22  0a  0a
          e   r   s   =   0   "  \n  \n
0000070

If I understand the output properly, it seems the correct character is used in the conf file. I will have to get back to the team to see if we can reproduce the issue internally and check what might be wrong. Some things you could try on your end in the meantime:

  • Check the dmesg output for suspicious messages.
  • Try the ncu command line interface, instead of the ncu-ui UI. You could try it as ncu <my-app> for a simple run.

I checked with our QA team, and they were not able to reproduce the issue on a similar setup. Can you confirm that you followed these installation instructions?

Other things to try:

  • Check if the kernel module is loaded with the correct parameter or not by calling $ grep RmProfilingAdminOnly /proc/driver/nvidia/params. The expected output would beRmProfilingAdminOnly: 0
  • Did you rebuild the initial ramdisk with $ sudo update-initramfs -u -k all ?
  • You could filter down the dmesg output with $ journalctl --dmesg --boot --grep=nvidia to relevant lines.

The problem seems to have resolved itself after power-cycling the server. I had rebooted the server previously, but still could not access the registers. But we had to cycle power on the server, and now the profiler can access the registers. I don’t have any other explanation for why it has started working.

Thanks for looking into it.