ERR_NVGPUCTRPERM fixes failing for non-admin users

jrwells · December 5, 2023, 6:50pm

Dear All,

We would like to make GPU Profiling available to all of the researchers using our high performance computing cluster. When attempting to run:
ncu --target-processes all -o profile python3 code.py

We see:
==ERROR== ERR_NVGPUCTRPERM - The user does not have permission to access NVIDIA GPU Performance Counters on the target device 0. For instructions on enabling permissions and to get more information see https://developer.nvidia.com/ERR_NVGPUCTRPERM

Following those instructions, we create a conf file in /etc/modprobe.d:

[root@X ~]# cat /etc/modprobe.d/nvidia.conf
options nvidia NVreg_RestrictProfilingToAdminUsers=0

However, upon reboot:

[root@X ~]# nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

/var/log/messages says:

Dec  4 11:35:09 holygpu7c1309 kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 234
Dec  4 11:35:09 holygpu7c1309 kernel: NVRM: The NVIDIA probe routine was not called for 4 device(s).
Dec  4 11:35:09 holygpu7c1309 kernel: NVRM: This can occur when a driver such as: #012NVRM: nouveau, rivafb, nvidiafb or rivatv #012NVRM: was loaded and obtained ownership of the NVIDIA device(s).
Dec  4 11:35:09 holygpu7c1309 kernel: NVRM: Try unloading the conflicting kernel module (and/or#012NVRM: reconfigure your kernel without the conflicting#012NVRM: driver(s)), then try loading the NVIDIA kernel module#012NVRM: again.
Dec  4 11:35:09 holygpu7c1309 kernel: NVRM: No NVIDIA devices probed.
Dec  4 11:35:09 holygpu7c1309 kernel: nvidia-nvlink: Unregistered Nvlink Core, major device number 234
Dec  4 11:35:09 holygpu7c1309 systemd-udevd[4770]: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) 255'' failed with exit code 1.
Dec  4 11:35:09 holygpu7c1309 systemd-udevd[4775]: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) 255'' failed with exit code 1.
Dec  4 11:35:09 holygpu7c1309 systemd-udevd[4770]: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) 255'' failed with exit code 1.
Dec  4 11:35:09 holygpu7c1309 kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 234
Dec  4 11:35:09 holygpu7c1309 kernel: NVRM: The NVIDIA probe routine was not called for 4 device(s).
Dec  4 11:35:09 holygpu7c1309 kernel: NVRM: This can occur when a driver such as:
NVRM: nouveau, rivafb, nvidiafb or rivatv
NVRM: was loaded and obtained ownership of the NVIDIA device(s).
Dec  4 11:35:09 holygpu7c1309 kernel: NVRM: Try unloading the conflicting kernel module (and/or
NVRM: reconfigure your kernel without the conflicting
NVRM: driver(s)), then try loading the NVIDIA kernel module
NVRM: again.
Dec  4 11:35:09 holygpu7c1309 kernel: NVRM: No NVIDIA devices probed.
Dec  4 11:35:09 holygpu7c1309 kernel: nvidia-nvlink: Unregistered Nvlink Core, major device number 234
Dec  4 11:35:09 holygpu7c1309 systemd-udevd[4770]: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) 255'' failed with exit code 1.
Dec  4 11:35:09 holygpu7c1309 systemd-udevd[4775]: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) 255'' failed with exit code 1.
Dec  4 11:35:09 holygpu7c1309 systemd-udevd[4770]: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) 255'' failed with exit code 1.
Dec  4 11:35:09 holygpu7c1309 systemd-udevd[4771]: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) 255'' failed with exit code 1.
Dec  4 11:35:09 holygpu7c1309 systemd-udevd[4771]: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) 255'' failed with exit code 1.

This is Rocky Linux release 8.7 (Green Obsidian), and our drivers are:

[root@X modprobe.d]# rpm -qa | grep nvidia
nvidia-xconfig-535.104.12-1.el8.x86_64
nvidia-libXNVCtrl-devel-535.104.12-1.el8.x86_64
nvidia-driver-cuda-535.104.12-1.el8.x86_64
nvidia-libXNVCtrl-535.104.12-1.el8.x86_64
nvidia-kmod-common-535.104.12-1.el8.noarch
nvidia-driver-devel-535.104.12-1.el8.x86_64
nvidia-persistenced-535.104.12-1.el8.x86_64
nvidia-driver-cuda-libs-535.104.12-1.el8.x86_64
nvidia-settings-535.104.12-1.el8.x86_64
dnf-plugin-nvidia-2.0-1.el8.noarch
nvidia-driver-535.104.12-1.el8.x86_64
nvidia-driver-NvFBCOpenGL-535.104.12-1.el8.x86_64
nvidia-driver-NVML-535.104.12-1.el8.x86_64
kmod-nvidia-latest-dkms-535.104.12-1.el8.x86_64
nvidia-driver-libs-535.104.12-1.el8.x86_64
nvidia-modprobe-535.104.12-1.el8.x86_64

How can I see if there is a kernel module conflict? Is there another solution for this problem?

Thank you!

veraj · December 7, 2023, 2:41am

Hi, @jrwells

Sorry for the issue you met.
Is it possible for you to uninstall the driver and re-install to see if there is any conflict ?

Also can you confirm the issue is caused by the conf file added ? That is to say, if you delete the conf and reboot, the driver works again.

pedmon · December 7, 2023, 3:38pm

Hi, I’m working with Jason on this. We see the problem even with a rebuilt driver using the latest release 545.23.08. It’s definitely that flag.

jrwells · December 7, 2023, 3:59pm

FYI: We have tried both of these and they still cause the crash:
options nvidia “NVreg_RestrictProfilingToAdminUsers=0”
options nvidia NVreg_RestrictProfilingToAdminUsers=0

pedmon · December 7, 2023, 5:22pm

We figured out the solution. The nouveau driver was taking over once we set that option. Apparently setting that option renabled it. I would probably rework the way the nvidia driver works so that it just squashes the nouveau driver in all cases.

veraj · December 8, 2023, 2:47am

Thanks for letting us know you have figured out the reason !

Topic		Replies	Views
Nsight Compute cannot access GPU performance counters Nsight Compute	5	2079	March 15, 2023
ERR_NVGPUCTRPERM Despite following instructions Nsight Compute	2	950	August 24, 2022
nvidia: probe of 0000:01:00.0 failed with error -1 Linux	1	7835	August 9, 2015
nvprof: Warning: The user does not have permission to profile on the target device. Visual Profiler and nvprof	20	25638	October 12, 2021
nvidia-smi "No devices were found" error CUDA Setup and Installation	23	62608	February 14, 2021
NCU CLI: No root permission, meet Error: ERR_NVGPUCTRPERM Nsight Compute	6	2582	May 11, 2024
Error while running a small program in nsight compute Nsight Compute	6	875	July 24, 2023
Driver version 555.58.02 failed to probe with kernel 6.10.3-200.fc40.x86_64 Linux	8	482	October 31, 2024
Broken GPU state query failure in AMD + H100 Confidential Computing	10	1083	February 15, 2024
Failing to load Nvidia driver Linux kernel , linux	7	7837	October 23, 2024

ERR_NVGPUCTRPERM fixes failing for non-admin users

Related topics