ERR_NVGPUCTRPERM fixes failing for non-admin users

Dear All,

We would like to make GPU Profiling available to all of the researchers using our high performance computing cluster. When attempting to run:
ncu --target-processes all -o profile python3 code.py

We see:
==ERROR== ERR_NVGPUCTRPERM - The user does not have permission to access NVIDIA GPU Performance Counters on the target device 0. For instructions on enabling permissions and to get more information see https://developer.nvidia.com/ERR_NVGPUCTRPERM

Following those instructions, we create a conf file in /etc/modprobe.d:

[root@X ~]# cat /etc/modprobe.d/nvidia.conf
options nvidia NVreg_RestrictProfilingToAdminUsers=0

However, upon reboot:

[root@X ~]# nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

/var/log/messages says:

Dec  4 11:35:09 holygpu7c1309 kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 234
Dec  4 11:35:09 holygpu7c1309 kernel: NVRM: The NVIDIA probe routine was not called for 4 device(s).
Dec  4 11:35:09 holygpu7c1309 kernel: NVRM: This can occur when a driver such as: #012NVRM: nouveau, rivafb, nvidiafb or rivatv #012NVRM: was loaded and obtained ownership of the NVIDIA device(s).
Dec  4 11:35:09 holygpu7c1309 kernel: NVRM: Try unloading the conflicting kernel module (and/or#012NVRM: reconfigure your kernel without the conflicting#012NVRM: driver(s)), then try loading the NVIDIA kernel module#012NVRM: again.
Dec  4 11:35:09 holygpu7c1309 kernel: NVRM: No NVIDIA devices probed.
Dec  4 11:35:09 holygpu7c1309 kernel: nvidia-nvlink: Unregistered Nvlink Core, major device number 234
Dec  4 11:35:09 holygpu7c1309 systemd-udevd[4770]: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) 255'' failed with exit code 1.
Dec  4 11:35:09 holygpu7c1309 systemd-udevd[4775]: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) 255'' failed with exit code 1.
Dec  4 11:35:09 holygpu7c1309 systemd-udevd[4770]: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) 255'' failed with exit code 1.
Dec  4 11:35:09 holygpu7c1309 kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 234
Dec  4 11:35:09 holygpu7c1309 kernel: NVRM: The NVIDIA probe routine was not called for 4 device(s).
Dec  4 11:35:09 holygpu7c1309 kernel: NVRM: This can occur when a driver such as:
NVRM: nouveau, rivafb, nvidiafb or rivatv
NVRM: was loaded and obtained ownership of the NVIDIA device(s).
Dec  4 11:35:09 holygpu7c1309 kernel: NVRM: Try unloading the conflicting kernel module (and/or
NVRM: reconfigure your kernel without the conflicting
NVRM: driver(s)), then try loading the NVIDIA kernel module
NVRM: again.
Dec  4 11:35:09 holygpu7c1309 kernel: NVRM: No NVIDIA devices probed.
Dec  4 11:35:09 holygpu7c1309 kernel: nvidia-nvlink: Unregistered Nvlink Core, major device number 234
Dec  4 11:35:09 holygpu7c1309 systemd-udevd[4770]: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) 255'' failed with exit code 1.
Dec  4 11:35:09 holygpu7c1309 systemd-udevd[4775]: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) 255'' failed with exit code 1.
Dec  4 11:35:09 holygpu7c1309 systemd-udevd[4770]: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) 255'' failed with exit code 1.
Dec  4 11:35:09 holygpu7c1309 systemd-udevd[4771]: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) 255'' failed with exit code 1.
Dec  4 11:35:09 holygpu7c1309 systemd-udevd[4771]: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) 255'' failed with exit code 1.

This is Rocky Linux release 8.7 (Green Obsidian), and our drivers are:

[root@X modprobe.d]# rpm -qa | grep nvidia
nvidia-xconfig-535.104.12-1.el8.x86_64
nvidia-libXNVCtrl-devel-535.104.12-1.el8.x86_64
nvidia-driver-cuda-535.104.12-1.el8.x86_64
nvidia-libXNVCtrl-535.104.12-1.el8.x86_64
nvidia-kmod-common-535.104.12-1.el8.noarch
nvidia-driver-devel-535.104.12-1.el8.x86_64
nvidia-persistenced-535.104.12-1.el8.x86_64
nvidia-driver-cuda-libs-535.104.12-1.el8.x86_64
nvidia-settings-535.104.12-1.el8.x86_64
dnf-plugin-nvidia-2.0-1.el8.noarch
nvidia-driver-535.104.12-1.el8.x86_64
nvidia-driver-NvFBCOpenGL-535.104.12-1.el8.x86_64
nvidia-driver-NVML-535.104.12-1.el8.x86_64
kmod-nvidia-latest-dkms-535.104.12-1.el8.x86_64
nvidia-driver-libs-535.104.12-1.el8.x86_64
nvidia-modprobe-535.104.12-1.el8.x86_64

How can I see if there is a kernel module conflict? Is there another solution for this problem?

Thank you!

Hi, @jrwells

Sorry for the issue you met.
Is it possible for you to uninstall the driver and re-install to see if there is any conflict ?

Also can you confirm the issue is caused by the conf file added ? That is to say, if you delete the conf and reboot, the driver works again.

Hi, I’m working with Jason on this. We see the problem even with a rebuilt driver using the latest release 545.23.08. It’s definitely that flag.

FYI: We have tried both of these and they still cause the crash:
options nvidia “NVreg_RestrictProfilingToAdminUsers=0”
options nvidia NVreg_RestrictProfilingToAdminUsers=0

We figured out the solution. The nouveau driver was taking over once we set that option. Apparently setting that option renabled it. I would probably rework the way the nvidia driver works so that it just squashes the nouveau driver in all cases.

Thanks for letting us know you have figured out the reason !